解析出所有 HTML 标签/非文本;爪哇
从网页中获取 html、剥离所有 HTML 标签/javascript 代码/任何非要显示的文本的内容,最后能够返回此信息,并为包含在其中的每一段文本添加一些分隔符,这是最好的方法是什么?不同的html标签?
首先,我尝试使用 JSOUP:
Document doc = Jsoup.connect("http://en.wikipedia.org/wiki/Main_Page").get();
String html = doc.body().text();
这对于取出所有非文本很有用,但不会返回任何类型的划分。
我目前正在尝试使用正则表达式,例如:
html.replaceAll("\\<.*?\\>", "")
但我真的不熟悉正则表达式,并且在取出 javascript 时遇到问题。然而,此方法确实有换行符,我可以使用它来跟踪来自不同标签包装的单独文本组。
我只是想知道在尝试更多正则表达式以使其工作之前是否有一些简单的方法可以做到这一点。
谢谢
What is the best way to take html from a webpage, strip all of the HTML tags/javascript code/ anything that's not text to be displayed, and finally be able to return this information with some separators for every piece of text that was wrapped in a different html tag?
First I tried using JSOUP:
Document doc = Jsoup.connect("http://en.wikipedia.org/wiki/Main_Page").get();
String html = doc.body().text();
This is good for taking out all the non-text but doesn't return me any sort of division.
I'm currently trying to use regex like:
html.replaceAll("\\<.*?\\>", "")
But I'm really not familiar with regex, and I have problems taking out javascript. This method however does have newlines that I can use to track down seperate text groups from different tag wrappings.
I was just wondering if there was some easy way of doing this before I try more regex to get it to work.
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
看起来 jsoup 没有提供一种立即明显的方法来做到这一点,因此我通过编辑源代码并将方法
text_mod()
添加到Element
进行了快速修改。这种方法有局限性,但如果您发现它有用,您可以在 http://ge.tt/ 下载修改后的 jar 9PAMpzA。这是补充:
例如,试试这个:
我得到
It looks like jsoup doesn't provide an immediately obvious way to do that, so I made a quick hack by editing the source code and adding the method
text_mod()
toElement
. There are limitations to this approach, but if you find it useful, you can download the modified jar at http://ge.tt/9PAMpzA.Here's the addition:
For example, try this:
I get
正则表达式通常不适用于任意 HTML,因为正则表达式无法完全解析 HTML(技术原因称为泵引理,这对于手头的任务并不重要)。
我建议从 XML 解析器开始(假设您的 HTML 没有做任何太奇怪的事情),然后在解析树中查找可显示标签中的数据。 XPath 表达式在这里会非常有用。
Regexes will generally not work for arbitrary HTML, since Regular Expressions can't fully parse HTML (The technical reason is called the Pumping Lemma, which isn't important for the task at hand).
I would recommend starting with an XML parser (assuming your HTML doesn't do anything too weird) and looking down the parse tree for data that goes in displayable tags. XPath expressions would be pretty helpful here.
在使用 DOM 的 JavaScript 中,您可以使用 DOM 元素的
textContent
或innerText
属性获取任何 HTML 元素的文本。如果您对 BODY 元素执行此操作,您将获得页面的“文本”版本。In JavaScript with the DOM you can get the text of any HTML element with the
textContent
orinnerText
properties of the DOM element. If you do this for the BODY element, you have a "text" version of the page.