从 html 获取将显示给用户的文本

发布于 2024-09-05 20:12:47 字数 480 浏览 1 评论 0原文

有点随机,我想玩一些 NLP 东西,我想:

从 HTML 获取将在浏览器中显示给用户的所有文本

我理想的输出中不会有任何标签,只会有句号(以及使用的任何其他标点符号)和换行符,尽管我可以容忍相当合理的失败量(随机其他内容最终出现在输出中)。

如果有一种方法可以在内容可能无法继续的情况下插入换行符或句号,那么这将被视为额外的好处。例如:

ul 或选项标签中的项目可以用句号分隔(或者说实话只是忽略)。

我正在使用 Java,但有兴趣查看执行此操作的任何代码。

我可以(如果需要的话)想出一些办法来做到这一点,只是想知道是否已经有类似的东西了,因为它可能比我在一个下午想出的更好;-)。

如果我最终这样做的话,我可能编写的代码示例是使用 SAX 解析器来查找 p 标签中的内容,去除任何 span 或强等标签,并在我点击 div 或另一个没有句号的 p。

非常欢迎任何指示或建议。

Bit of a random one, i am wanting to have a play with some NLP stuff and I would like to:

Get all the text that will be displayed to the user in a browser from HTML.

My ideal output would not have any tags in it and would only have fullstops (and any other punctuation used) and new line characters, though i can tolerate a fairly reasonable amount of failure in this (random other stuff ending up in output).

If there was a way of inserting a newline or full stop in situations where the content was likely not to continue on then that would be considered an added bonus. e.g:

items in an ul or option tag could be separated by full stops (or to be honest just ignored).

I am working Java, but would be interested in seeing any code that does this.

I can (and will if required) come up with something to do this, just wondered if there was anything out there like this already, as it would probably be better than what I come up with in an afternoon ;-).

An example of the code I might write if I do end up doing this would be to use a SAX parser to find content in p tags, strip it of any span or strong etc tags, and add a full stop if I hit a div or another p without having had a fullstop.

Any pointers or suggestions very welcome.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

失而复得 2024-09-12 20:12:48

HTML 解析器似乎是一个合理的起点。

其中有很多,例如: HTMLCleanerNekohtml 似乎工作正常。

它们很好,因为它们可以修复标签,让您能够更一致地处理它们,即使您只是删除它们。

但事实证明,您可能想要摆脱脚本标签元数据等。在这种情况下,您最好使用这些人从“野生”html 中为您获取的格式良好的 XML。

有很多与此相关的问题(例如这个),您应该搜索“ HTML 解析”不过;-)

HTML parsers seem to be a reasonable starting point for this.

there are a number of them for example: HTMLCleaner and Nekohtml seem to work fine.

They are good as they fix the tags to allow you to more consistently process them, even if you are just removing them.

But as it turns out you probably want to get rid of script tags meta data etc. And in that case you are better working with well formed XML which these guy get for you from "wild" html.

there are many SO questions relating to this (like this one) you should search for "HTML parsing" though ;-)

递刀给你 2024-09-12 20:12:47

嗯……几乎任何 HTML 解析器都可以用来创建您想要的效果——只需运行所有标签并仅发出文本元素,并为每个块元素的结束标签发出 LF。正如您所说,SAX 实现将简单且直接。

Hmmm ... almost any HTML parser could be used to create the effect you want -- just run through all of the tags and emit only the text elements, and emit a LF for the closing tag of every block element. As you say, a SAX implementation would be simple and straight-forward.

獨角戲 2024-09-12 20:12:47

我会删除所有带有 <> 的内容标签,如果您想在每个句子的末尾添加句号,请检查结束标签并放置句号。

如果您有

<strong> test </strong>

(以及其他改变测试外观的标签),您可以放置​​条件而不在此处放置句号。

I would just strip everything out that has <> tags and if you want to have a full stop at the end of every sentence you check for closing tags and place a full stop.

If you have

<strong> test </strong>

(and other tags that change the look of the test) you could place in conditions to not place a full stop here.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文