从网页中剥离 HTML 并计算词频?
在 Groovy 中,如何抓取网页并删除 HTML 标签等,只留下文档的文本? 我希望将结果转储到一个集合中,以便我可以构建一个词频计数器。
最后,让我再次提及,我想在 Groovy 中执行此操作。
In Groovy, how do I grab a web page and remove HTML tags, etc., leaving only the document's text? I'd like the results dumped into a collection so I can build a word frequency counter.
Finally, let me mention again that I'd like to do this in Groovy.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您可以使用 Lynx Web 浏览器吐出文档文本并保存。
您想自动执行此操作吗? 您想要一个单独的应用程序来执行此操作吗? 或者您需要帮助将其编码到您的应用程序中吗? 它将在哪些平台(Windows 桌面、Web 服务器等)上运行?
You can use the Lynx Web Browser to spit out the document text and save it.
Do you want to do this automatically? Do you want a separate application that does this? Or do you want help coding it into your application? What platforms (windows desktop, web server, etc) will it run on?
如果您想要 HTML 中的标记化单词集合,那么您不能像 XML 一样解析它(需要是有效的 XML)并获取标记之间的所有文本吗? 像这样的事情怎么样:
If you want a collection of tokenized words from HTML then can't you just parse it like XML (needs to be valid XML) and grab all of the text between tags? How about something like this:
假设您想使用 Groovy 来完成此操作(根据 groovy 标签进行猜测),您的方法可能会严重面向 shell 脚本或使用 Java 库。 就 shell 脚本而言,我同意 moogs 的观点,使用 Lynx 或 Elinks 可能是最简单的方法。 否则请查看 HTMLParser 并查看 处理文件中的每个单词(向下滚动以查找相关代码片段)
您可能一直在寻找与 Groovy 一起使用的 Java 库来进行 HTML 解析,因为它似乎没有任何 Groovy 库。 如果您没有使用 Groovy,请发布所需的语言,因为有大量 HTML 到文本工具 就在那里,具体取决于您使用的语言。
Assuming you want to do this with Groovy (guessing based on the groovy tag), your approaches are likely to be either heavily shell-script oriented or using Java libraries. In the case of shell-scripting I would agree with moogs, using Lynx or Elinks is probably the easiest way to go about it. Otherwise have a look at HTMLParser and see Processing Every Word in a File (scroll down to find the relevant code snippet)
You're probably stuck with finding Java libs for use with Groovy for the HTML parsing, as it doesn't appear there are any Groovy libs for it. If you're not using Groovy, then please post the desired language, since there are a multitude of HTML to text tools out there, depending on what language you're working in.