从 HTML Java 中提取文本
我正在开发一个程序,该程序下载 HTML 页面,然后选择一些信息并将其写入另一个文件。
我想提取段落标签之间的信息,但我只能获取段落的一行。我的代码如下;
FileReader fileReader = new FileReader(file);
BufferedReader buffRd = new BufferedReader(fileReader);
BufferedWriter out = new BufferedWriter(new FileWriter(newFile.txt));
String s;
while ((s = br.readLine()) !=null) {
if(s.contains("<p>")) {
try {
out.write(s);
} catch (IOException e) {
}
}
}
我试图添加另一个 while 循环,它会告诉程序继续写入文件,直到该行包含 标记,方法是:
while ((s = br.readLine()) !=null) {
if(s.contains("<p>")) {
while(!s.contains("</p>") {
try {
out.write(s);
} catch (IOException e) {
}
}
}
}
但这行不通。有人可以帮忙吗?
I'm working on a program that downloads HTML pages and then selects some of the information and write it to another file.
I want to extract the information which is intbetween the paragraph tags, but i can only get one line of the paragraph. My code is as follows;
FileReader fileReader = new FileReader(file);
BufferedReader buffRd = new BufferedReader(fileReader);
BufferedWriter out = new BufferedWriter(new FileWriter(newFile.txt));
String s;
while ((s = br.readLine()) !=null) {
if(s.contains("<p>")) {
try {
out.write(s);
} catch (IOException e) {
}
}
}
i was trying to add another while loop, which would tell the program to keep writing to file until the line contains the </p>
tag, by saying;
while ((s = br.readLine()) !=null) {
if(s.contains("<p>")) {
while(!s.contains("</p>") {
try {
out.write(s);
} catch (IOException e) {
}
}
}
}
But this doesn't work. Could someone please help.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
jsoup
我非常喜欢使用的另一个 html 解析器是 jsoup。您可以通过 2 行代码获取所有
元素。
然后将其写入文件中的另一行
,或者如果您希望它们位于单独的行中,您可以迭代元素并单独将它们写出。
jsoup
Another html parser I really liked using was jsoup. You could get all the
<p>
elements in 2 lines of code.Then write it out to a file in one more line
or if you want them on separate lines you can iterate through the elements and write them out separately.
jericho 是几个可能的 html 解析器之一,可以使此任务既简单又安全。
jericho is one of several posible html parsers that could make this task both easy and safe.
JTidy 可以将 HTML 文档(甚至是格式错误的文档)表示为文档模型,使得提取
标记的内容比手动处理原始文本更加优雅。
JTidy can represent an HTML document (even a malformed one) as a document model, making the process of extracting the contents of a
<p>
tag a rather more elegant process than manually thunking through the raw text.尝试(如果您不想使用 HTML 解析器库):
Try (if you don't want to use a HTML parser library):
我已经成功使用 TagSoup &用于解析 HTML 的 XPath。
http://home.ccil.org/~cowan/XML/tagsoup/
I've had success using TagSoup & XPath to parse HTML.
http://home.ccil.org/~cowan/XML/tagsoup/
使用 ParserCallback。它是 JDK 中包含的一个简单类。每次找到新标签时,它都会通知您,然后您可以提取标签的文本。简单的例子:
所以你需要做的就是在找到段落标签时设置一个布尔标志。然后在handleText()方法中提取文本。
Use a ParserCallback. Its a simple class thats included with the JDK. It notifies you every time a new tag is found and then you can extract the text of the tag. Simple example:
So all you need to do is set a boolean flag when the paragraph tag is found. Then in the handleText() method you extract the text.
试试这个。
Try this.
您可能只是使用了错误的工具来完成这项工作:
You may just be using the wrong tool for the job: