HTML 文章内容提取 - Alchemy API 替代方案
我一直在做大量研究,以找出编写应用程序以从几乎所有 HTML 网页获取主要文章内容的最佳方法。我有一个 C 程序,它使用 libxml2 来解析 XML,但我遇到了 Alchemy API,它似乎可以满足我的要求。
然而,它只有一个在线 API,我想将应用程序保留在内部,而不依赖于任何外部调用。
那么有人有提示吗?我希望有一个离线替代方案可以完成 Alchemy API 的功能(付费/非付费)。
我的替代方案可能是只解析 HTML 并使用 NLP(自然语言处理)技术和其他方法来获取主要文章内容。将使用的网站类型包括带有新闻部分或博客的网站。
I've been doing a lot of research to figure out the best way to code an application to get the main article content from almost any HTML webpage. I have a C program that uses libxml2 to parse through the XML, but I came across Alchemy API, which appears to do what I want.
However, it only has an online API and I wanted to keep the application in-house without relying on any external calls.
So does anybody have tips? I was hoping for an off-line alternative that does what Alchemy API can do (paid/non-paid).
My alternative may be to just parse the HTML and use NLP (Natural Language Processing) techniques and other methods to get at the main article content. The types of websites that it will be used include websites with a news section or a blog.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
有一些开源工具可以执行类似的文章提取任务。
https://github.com/jiminoc/goose 是由 Gravity.com 开源的 它
有以下信息wiki 以及您可以查看的源代码。有数十个单元测试显示从各种文章中提取的文本。
there are a few open source tools available that do similar article extraction tasks.
https://github.com/jiminoc/goose which was open source by Gravity.com
It has info on the wiki as well as the source you can view. There are dozens of unit tests that show the text extracted from various articles.
AlchemyAPI 还提供本地解决方案,因此您无需在线访问它。一般来说,拥有本地解决方案的客户如果有特殊的安全或延迟要求,就会使用它。有关本地解决方案的更多信息,请访问:http://www.alchemyapi.com/产品/本地/
AlchemyAPI also offers an on-premise solution so that you don't have to access it online. Generally our customers that have the on-premise solutions are using it if they have special security or latency requirements. More information on on-premise solutions can be found here: http://www.alchemyapi.com/products/on-premise/