如何从合理的 HTML 中提取文本?
我的问题有点像这个问题,但我有更多限制:
- 我知道该文档相当健全,
- 它们非常规则(它们都来自同一来源,
- 我想要大约 99% 的可见文本,
- 大约 99% 的可用内容是文本(它们或多或少将 RTF 转换为 HTML)
- 我不关心格式,甚至段落分隔符。
是否有任何工具可以执行此操作,或者我最好只使用 RegexBuddy 和 C#?
我对命令行或批处理工具以及 C/C# 持开放态度。 /D 库。
My question is sort of like this question but I have more constraints:
- I know the document's are reasonably sane
- they are very regular (they all came from the same source
- I want about 99% of the visible text
- about 99% of what is viable at all is text (they are more or less RTF converted to HTML)
- I don't care about formatting or even paragraph breaks.
Are there any tools set up to do this or am I better off just breaking out RegexBuddy and C#?
I'm open to command line or batch processing tools as well as C/C#/D libraries.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(10)
我今天使用 HTML Agility Pack 编写的这段代码将提取未格式化的修剪文本。
如果您想保持一定程度的格式设置,可以在示例的基础上构建提供来源。
This code I hacked up today with HTML Agility Pack, will extract unformatted trimmed text.
If you want to maintain some level of formatting you can build on the sample provided with the source.
您可以使用支持从 HTML 中提取文本的 NUglify :
由于它使用 HTML5 自定义解析器,因此应该是相当健壮(特别是如果文档不包含任何错误)并且速度非常快(不涉及正则表达式,而是纯粹的递归下降解析器)
You can use NUglify that supports text extraction from HTML:
As it is using a HTML5 custom parser, it should be quite robust (specially if the document doesn't contain any errors) and is a very fast (no regexp involved but a pure recursive descent parser)
您需要使用 HTML Agility Pack。
您可能希望使用 LINQ 和
Descendants
调用来查找元素,然后获取其InnerText
。You need to use the HTML Agility Pack.
You probably want to find an element using LINQ ant the
Descendants
call, then get itsInnerText
.这是我正在使用的代码:
Here is the code I am using:
相对简单,如果将 HTML 加载到 C# 中,然后使用 C#/WinForms 中的 mshtml.dll 或 WebBrowser 控件,则可以将整个 HTML 文档视为一棵树,遍历树捕获 InnerText 对象。
或者,您也可以使用 document.all,它获取树,将其展平,然后您可以迭代树,再次捕获 InnerText。
这是一个例子:
希望有帮助!
It's relatively simple if you load the HTML into C# and then using the mshtml.dll or the WebBrowser control in C#/WinForms, you can then treat the entire HTML document as a tree, traverse the tree capturing the InnerText objects.
Or, you could also use document.all, which takes the tree, flattens it, and then you can iterate across the tree, again capturing the InnerText.
Here's an example:
Hope that helps!
这是最好的方法:
Here is the Best way:
这是我开发的一个类来完成同样的事情。所有可用的 HTML 解析库都太慢了,正则表达式也太慢了。代码注释中解释了功能。根据我的基准测试,在 Amazon 登陆页面(如下所示)上测试时,此代码比 HTML Agility Pack 的等效代码快 10 倍多一点。
等效于 HtmlAgilityPack:
Here's a class I developed to accomplish the same thing. All available HTML parsing libraries were far too slow, regex was far too slow as well. Functionality is explained in the code comments. From my benchmarks, this code is a little over 10X faster than HTML Agility Pack's equivalent code when tested on Amazon's landing page (included below).
Equivalent in HtmlAgilityPack:
在这里,您可以下载一个在 HTML 和 XAML 之间进行转换的工具及其源代码: XAML/HTML 转换器。
它包含一个 HTML 解析器(这样的东西显然必须比标准 XML 解析器更宽容)并且您可以像 XML 一样遍历 HTML。
Here you can download a tool and its source that converts to and fro HTML and XAML: XAML/HTML converter.
It contains a HTML parser (such a thing must obviously be much more tolerant than your standard XML parser) and you can traverse the HTML much similar to XML.
从命令行,您可以使用 Lynx 文本浏览器 像这样:
您可以使用
-nolist
禁用链接列表。例如:From the command line, you can use the Lynx text browser like this:
You can disable the list of links with
-nolist
. For example:尝试下一个代码
try next code