如何判断文档是否为文章?

发布于 2024-10-30 05:08:25 字数 258 浏览 5 评论 0原文

假设我有 X 个文档,算法/库/tika config/nekohtml 过滤器会告诉我哪些是“文章”,哪些不是,对于那些给我文章文本的文件(即没有其他周围文本) )。

我所说的一篇文章是指由至少一个段落组成的一大堆结构化文本,我认为大多数人类读者都可以过滤掉它们。

我想到的最简单的方法是确保 doclength > Y,例如,Y 为 350 个单词。 但这不是最可靠的方法,因为例如可能有很长的列表,并且它没有给我文章文本。 寻找

标签,还不够好。

Say I've X documents what algorithm/library/tika config/nekohtml filter would tell me which of those is an "article" and which is not, and for those that are give me the article text (i.e. w/o other surrounding text).

By an article I mean a chunck of structured text comprosing at least one paragraph, and I think most human readers can filter those out.

The easiest way I thought of is ensuring that doclength > Y, where Y would be 350 words for example.
But that's not the most reliable of ways, since there could be very long lists for example, and it doesn't give me the article text.
Looking for

tags, is not good enough.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

赠佳期 2024-11-06 05:08:25

您可以使用Boilerpipe从页面中提取文本,然后自行决定如果这是一篇基于您的启发式的文章,即文章长度。但恐怕你的解决方案无论如何都行不通。断开连接的项目列表看起来仍然像句子列表。您需要“理解”内容。

You can user Boilerpipe to extract the text from the page and then decide yourself if it's an article based on your heuristics, i.e. article length. I'm afraid though that your solution would not work anyway. A list of disconnected items still look like a list of sentences. You'd need to "understand" the content.

忘你却要生生世世 2024-11-06 05:08:25

根据吞吐量、延迟、连接性和其他非技术因素(例如金钱),如果这是人类可以轻松完成但计算机很难完成的事情,您可能需要考虑使用 Amazon Mechanical Turk 定义 HIT 来区分文章和其他类型的文本。有一个 API 可将 HIT 结果与您的代码集成。

Depending on factors like throughput, latency, connectivity, and other non-technical factors such as money, if it's something that humans can easily do but hard for computers, you might want to consider using Amazon Mechanical Turk to define HITs to tell an article from other kinds of text. There's an API to integrate HIT results with your code.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文