将网页转换为纯文本..?
我正在尝试将网页转换为纯文本。但如果我遇到该表,我也会得到 td 和 tr 标签。如果我替换这些表格标签,那么我将无法获取某些内容。
这是我的代码,
string s = Regex.Replace(htmldoc, "<script.*?</script>", "", RegexOptions.Singleline | RegexOptions.IgnoreCase);
s = Regex.Replace(s, "<!--.*?-->", "", RegexOptions.Singleline | RegexOptions.IgnoreCase);
s = Regex.Replace(s, "<style.*?style>", "", RegexOptions.Singleline | RegexOptions.IgnoreCase);
s = Regex.Replace(s, "<a.*?a>", "", RegexOptions.Singleline | RegexOptions.IgnoreCase);
s = Regex.Replace(s, "<img.*?img>", "", RegexOptions.Singleline | RegexOptions.IgnoreCase);
s = Regex.Replace(s, "<table.*?table>", "", RegexOptions.Singleline | RegexOptions.IgnoreCase);
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(s);
s = doc.DocumentNode.SelectSingleNode("//body").InnerText.Trim();
请检查它并告诉我如何在不获取 td 和 tr 标签的情况下从表中获取内容。
I am trying to convert the webpage into a plain text. But if I encountered the table I am getting td and tr tags too. If I replace those table tags then I can't get some of the content.
Here is my code
string s = Regex.Replace(htmldoc, "<script.*?</script>", "", RegexOptions.Singleline | RegexOptions.IgnoreCase);
s = Regex.Replace(s, "<!--.*?-->", "", RegexOptions.Singleline | RegexOptions.IgnoreCase);
s = Regex.Replace(s, "<style.*?style>", "", RegexOptions.Singleline | RegexOptions.IgnoreCase);
s = Regex.Replace(s, "<a.*?a>", "", RegexOptions.Singleline | RegexOptions.IgnoreCase);
s = Regex.Replace(s, "<img.*?img>", "", RegexOptions.Singleline | RegexOptions.IgnoreCase);
s = Regex.Replace(s, "<table.*?table>", "", RegexOptions.Singleline | RegexOptions.IgnoreCase);
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(s);
s = doc.DocumentNode.SelectSingleNode("//body").InnerText.Trim();
Please check it and tell me how can I get the contents from table without getting td and tr tags.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
如果您使用 HTML Agility pack 来解析表格,则无需使用正则表达式删除 HTML 标签。这里有一些使用 HTML Agility pack 解析表的好例子。例如:HTML 敏捷包 - 解析表
If you are using HTML Agility pack to parse the table you don't need to remove the HTML tags with your regex. There are some good examples of parsing tables using HTML Agility pack here on SO. ex: HTML Agility pack - parsing tables
您可以使用正文的
InnerText
:接下来,您可能想要折叠空格和新行:
但是请注意,虽然在这种情况下有效,但诸如
hello
或world 之类的标记
helloworld
将由InnerText
转换为helloworld
- 删除标签。解决这个问题很困难,因为显示通常由 CSS 决定,而不仅仅是由标记决定。You can use the body's
InnerText
:Next, you may want to collapse spaces and new lines:
Note, however, that while it is working in this case, markup such as
hello<br>world
orhello<i>world</i>
will be converted byInnerText
tohelloworld
- removing the tags. It is difficult to solve that issue, as display is often determined by the CSS, not just by the markup.