C# 抓取 HTML/.asp

发布于 2024-08-16 14:52:59 字数 2078 浏览 5 评论 0原文

我正在研究“个人可以工作”之类的事情,除了尝试将 .asp 源文件中的一些信息解析到我的程序中之外,我一切都工作得很好。

这是我到目前为止的解析代码

 // parse out the results
            try 
            {
                int snr_start = result.IndexOf("SNR");
                int snr_end = result.IndexOf("</TR>", snr_start);
                snr = result.Substring(snr_start, snr_end - snr_start);
                snr = snr.Substring(snr.IndexOf("<TD>") + 1);
                snr = snr.Substring(0, snr.Length - 6);
                iSNR = Convert.ToInt32(snr.Substring(0, snr.IndexOf(" ")));

                int dnpwr_start = result.IndexOf("Downstream Power", snr_end);
                int dnpwr_stop = result.IndexOf("</TR>", dnpwr_start);
                dnpwr = result.Substring(dnpwr_start, dnpwr_stop - dnpwr_start);
                dnpwr = dnpwr.Substring(dnpwr.IndexOf("<TD>") + 1);
                dnpwr = dnpwr.Substring(0, dnpwr.IndexOf("<TABLE") - 1);
                iDPWR = Convert.ToInt32(dnpwr.Substring(0, dnpwr.IndexOf(" ")));

                int uppwr_start = result.IndexOf("Upstream Power", dnpwr_stop);
                int uppwr_stop = result.IndexOf("</TR>", uppwr_start);
                uppwr = result.Substring(uppwr_start, uppwr_stop - uppwr_start);
                uppwr = uppwr.Substring(uppwr.IndexOf("<TD>") + 1);
                uppwr = uppwr.Substring(0, uppwr.IndexOf("</TD>") - 1);
                iUPWR = Convert.ToInt32(uppwr.Substring(0, uppwr.IndexOf(" ")));
            }
            catch 

这是源文件和我试图从中获取的信息(SNR,下游功率,上游功率)

<td class="headerR">Downstream Power</td>
<td class="contentL">1.0 dBmV</td>
</tr>
<tr>
<td class="headerR">SNR</td>
<td class="contentL">39.656 dB</td>
</tr>
<tr>
<td class="headerR">Upstream Power</td>
<td class="contentL">42.0 dBmV</td>
</tr>

不太确定我哪里出错了,但任何帮助都会很大赞赏。 该项目的重点是让我能够解析调制解调器(我是 MSO 员工)的信号电平,以进行扩展监控。如果需要,我可以从 .asp 页面发布完整的源代码,

谢谢, 马特

I'm working on a "personal-can-it-work" sort of thing, and i have everything working great except for trying to parse some information from a .asp sourcefile into my Program.

This is the parsing code i have so far

 // parse out the results
            try 
            {
                int snr_start = result.IndexOf("SNR");
                int snr_end = result.IndexOf("</TR>", snr_start);
                snr = result.Substring(snr_start, snr_end - snr_start);
                snr = snr.Substring(snr.IndexOf("<TD>") + 1);
                snr = snr.Substring(0, snr.Length - 6);
                iSNR = Convert.ToInt32(snr.Substring(0, snr.IndexOf(" ")));

                int dnpwr_start = result.IndexOf("Downstream Power", snr_end);
                int dnpwr_stop = result.IndexOf("</TR>", dnpwr_start);
                dnpwr = result.Substring(dnpwr_start, dnpwr_stop - dnpwr_start);
                dnpwr = dnpwr.Substring(dnpwr.IndexOf("<TD>") + 1);
                dnpwr = dnpwr.Substring(0, dnpwr.IndexOf("<TABLE") - 1);
                iDPWR = Convert.ToInt32(dnpwr.Substring(0, dnpwr.IndexOf(" ")));

                int uppwr_start = result.IndexOf("Upstream Power", dnpwr_stop);
                int uppwr_stop = result.IndexOf("</TR>", uppwr_start);
                uppwr = result.Substring(uppwr_start, uppwr_stop - uppwr_start);
                uppwr = uppwr.Substring(uppwr.IndexOf("<TD>") + 1);
                uppwr = uppwr.Substring(0, uppwr.IndexOf("</TD>") - 1);
                iUPWR = Convert.ToInt32(uppwr.Substring(0, uppwr.IndexOf(" ")));
            }
            catch 

And this is the Sourcefile and the Information i'm trying to scrape from it (SNR, Downstream Power, Upstream Power)

<td class="headerR">Downstream Power</td>
<td class="contentL">1.0 dBmV</td>
</tr>
<tr>
<td class="headerR">SNR</td>
<td class="contentL">39.656 dB</td>
</tr>
<tr>
<td class="headerR">Upstream Power</td>
<td class="contentL">42.0 dBmV</td>
</tr>

Not too sure where i'm going wrong to, but any helpwould be greatly appreaciated.
The focus of the project is so i can parse the signal levels off of my modem (I'm a MSO employee) for extended monitoring. If needed i can post the full source from the .asp page

Thanks,
Matt

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

谜泪 2024-08-23 14:53:00

如果您只想从一张表中提取数据,这应该可行:

int start = result.IndexOf("<table>");
int end = result.IndexOf("</table>", start) + 8;
var doc = new XmlDocument();
doc.LoadXml(result.Substring(start, end - start);

iSNR = Convert.ToInt32(doc.SelectSingleNode("following-sibling::tr/td[text() = 'SNR']").InnerText.Split(' ')[0]);
iDPWR = Convert.ToInt32(doc.SelectSingleNode("following-sibling::tr/td[text() = 'Downstream Power']").InnerText.Split(' ')[0]);
iUPWR = Convert.ToInt32(doc.SelectSingleNode("following-sibling::tr/td[text() = 'Upstream Power']").InnerText.Split(' ')[0]);

This should work if you only want to pull the data from one table:

int start = result.IndexOf("<table>");
int end = result.IndexOf("</table>", start) + 8;
var doc = new XmlDocument();
doc.LoadXml(result.Substring(start, end - start);

iSNR = Convert.ToInt32(doc.SelectSingleNode("following-sibling::tr/td[text() = 'SNR']").InnerText.Split(' ')[0]);
iDPWR = Convert.ToInt32(doc.SelectSingleNode("following-sibling::tr/td[text() = 'Downstream Power']").InnerText.Split(' ')[0]);
iUPWR = Convert.ToInt32(doc.SelectSingleNode("following-sibling::tr/td[text() = 'Upstream Power']").InnerText.Split(' ')[0]);
瑕疵 2024-08-23 14:52:59

我不太热衷于使用这些字符串方法进行屏幕抓取,除非这是你的最后手段。

您可以尝试使用一些正则表达式......或者甚至更好,如果您可以保证您的 HTML 源格式良好(XHTML),您可以将其(或您想要的 XML 片段)加载到 XML Document 对象中并使用 XPath或 Linq-to-XML (XLinq)(如果使用 .NET 3.5)。

I am not too keen on using those string methods for screen scraping unless it's your last resort.

You can try using some Regex...or even better, if you can guarantee that your HTML source is well formed (XHTML), you could load it (or the snippet of XML you want) into an XML Document object and use either XPath, or Linq-to-XML (XLinq), if using .NET 3.5.

相权↑美人 2024-08-23 14:52:59

ASP源码?您最好的选择可能是正则表达式 - 它们是为此类任务而设计的。任何类型的抓取通常都意味着值得您花时间去挖掘它们。

你用什么语言来解析它?如果是 .Net,您可以使用 Regex 类轻松获得名称/值对。

正则表达式如下:

"<tr>\s*<td\s+class\s*=\s*\"headerR\"\s*>\s*(?<name>[^<])\s*</td\s*>\s*<td\s+class\s*=\s*\"contentL\"\s*>\s*(?<value>[^<])\s*</td\s*>\s*</tr\s*>"

然后,您可以循环捕获并获取名称值对列表:

“下游功率”:“1.0 dBmV”
“信噪比”:“39.656 分贝”
“上游功率”:“42.0 dbmV”

应该很简单。

ASP source? Your best bet is probably a regular expression - they're designed for this kind of task. Any kind of scraping usually means it will be worth your while to dig into them.

What langauge are you using to parse it? If .Net, you can get your name/value pairs easy with the Regex class.

Something like this for the regex:

"<tr>\s*<td\s+class\s*=\s*\"headerR\"\s*>\s*(?<name>[^<])\s*</td\s*>\s*<td\s+class\s*=\s*\"contentL\"\s*>\s*(?<value>[^<])\s*</td\s*>\s*</tr\s*>"

Then, you can loop throught the captures and get your list of name value pairs:

"Downstream Power":"1.0 dBmV"
"SNR":"39.656 dB"
"Upstream Power":"42.0 dbmV"

Should be straightforward.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文