通过Java提取HTML中两个链接之间的文本

发布于 2024-11-02 00:01:31 字数 559 浏览 0 评论 0原文

我正在尝试使用 Java 从 ePub 文件中检索文本数据。 ePub 文件的文本位于一个 HTML 文件中，其格式如下 -

<h2 id="pgepubid00001">Chapter I</h2>

<p>Some text</p>
<p>Another line of Text</p>

<br/>

<h2 id="pgepubid00001">Chapter II</h2>

etc..

在打开此文件之前，我已经知道需要提取的章节的 id，并且也可以找到下一章的 id。因此，我认为合理的方法是尝试在 SAX 解析器中解析它并提取每个段落中的文本，直到到达下一章的链接。但事实证明这是一项艰巨的任务。

当然，一切都是动态的，因此没有设置的链接可供访问等。HTML 是半严格格式化的，所以我没想到解析会成为这么大的问题。谁能推荐一种提取所需文本的好方法？

解决方案必须是JAVA ONLY，不能使用其他语言。我希望在 Android 设备中实现此功能

原文

I am trying to retrieve the text data from an ePub file using Java. The text of the ePub file lies within a HTML file that is formatted something like this -

<h2 id="pgepubid00001">Chapter I</h2>

<p>Some text</p>
<p>Another line of Text</p>

<br/>

<h2 id="pgepubid00001">Chapter II</h2>

etc..

Before opening this file I already know the id of the Chapter I need to extract and can find the id of the next chapter too. Because of this I thought a logical approach would be to attempt to parse it in a SAX parser and extract the text in each paragraph until I reached the link of the next chapter. But this is proving quite a task.

Of course, everything is dynamic so there is no set link to go to etc. The HTML is semi-strictly formatted so I didn't expect parsing to be so much of a problem. Can anyone recommend a good way to extract the text needed?

The solution needs to be JAVA ONLY, no other languages can be used. I am looking to implement this in an Android device

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

不喜欢何必死缠烂打 2024-11-09 00:01:31

好吧，你知道章节的 id，为什么不使用 String.indexOf 呢？

start = text.indexOf("<h2 id=\"pgepubid00001\">");
end = text.indexOf("<h2 id=\"pgepubid00002\">");

whatYoureLookingFor = text.substring(start, end-start)

保持简单。

Well, you know ids of the chapters, why not use String.indexOf ?

start = text.indexOf("<h2 id=\"pgepubid00001\">");
end = text.indexOf("<h2 id=\"pgepubid00002\">");

whatYoureLookingFor = text.substring(start, end-start)

Keep it simple.

回复收藏 0 原文

~没有更多了~