Jsoup 在指定标签之后开始解析还是从页面底部开始?
我有一个正在用 Jsoup 解析的 HTML 块,但是,并非所有内容都是相关的,并且解析不相关的部分会丢弃我的数据集。
网站上有一个可以随时更改的标题。此标题中包含链接,但我不关心链接。当 Jsoup 解析文档时,它将这些想法添加到我的链接数组中并丢弃我的值。
我感兴趣的 HTML 出现在 标签。
我希望能够告诉 Jsoup 忽略该标签上方的所有内容。这可能吗?如果没有,我可以通过在文档底部开始解析来解决这个问题,但我也不确定如何解决这个问题。
我的 Jsoup 查询如下。请忽略所有注释掉的行和调试语句,我已经尝试解决这个问题有一段时间了,并且仍然有测试代码。
Thread getTitlesThread = new Thread() {
public void run() {
TitleResults titleArray = new TitleResults();
StringBuilder whole = new StringBuilder();
try {
URL url = new URL(
Constants.FORUM);
HttpURLConnection urlConnection = (HttpURLConnection) url.openConnection();
urlConnection.setRequestProperty("User-Agent", "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2");
try {
BufferedReader in = new BufferedReader(
new InputStreamReader(new BufferedInputStream(urlConnection.getInputStream())));
String inputLine;
while ((inputLine = in.readLine()) != null)
whole.append(inputLine);
in.close();
} catch (IOException e) {}
finally {
urlConnection.disconnect();
}
} catch (Exception e) {}
Document doc = Parser.parse(whole.toString(), Constants.FORUM);
Elements threads = doc.select("TOPICS > .topic_title");
Elements authors = doc.select("a[hovercard-ref]");
// for (Element author : authors) {
// authorArray.add(author.text());
// }
// cleanAuthors();
if (threads.isEmpty()) {
Log.d("POC", "EMPTY BRO!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!11");
}
// for (Element thread : threads) {
// titleArray = new TitleResults();
// Log.d("POC", thread.toString());
//
// titleArray.setAuthorDate(authorArray.get(0));
// authorArray.remove(0);
//Thread title
// threadTitle = thread.text();
// titleArray.setItemName(threadTitle);
//
// //Thread link
// String threadStr = thread.attr("abs:href");
// String endTag = "/page__view__getnewpost"; //trim link
// threadStr = new String(threadStr.replace(endTag, ""));
// threadArray.add(threadStr);
// results.add(titleArray);
// }
}
};
getTitlesThread.start();
I have a block of HTML that I am parsing with Jsoup, however, not all of it is relevant, and parsing the irrelevant parts throws off my data set.
On the site, there is a header that can change at any time. Within this header are links, but links that I don't care about. When Jsoup parses the document, it adds those thinks to my link array and throws off my values.
The HTML I am interested in comes after the<!-- BEGIN TOPICS -->
tag.
I would like to be able to tell Jsoup to ignore everything above that tag. Is this possible? If not, I can work around this issue by beginning my parsing at the bottom of the document, but I'm not sure how I would go about that either.
My Jsoup query is as follows. Please ignore all the commented out lines and debugging statements, I've been trying to work this out for a while and still have the test code in.
Thread getTitlesThread = new Thread() {
public void run() {
TitleResults titleArray = new TitleResults();
StringBuilder whole = new StringBuilder();
try {
URL url = new URL(
Constants.FORUM);
HttpURLConnection urlConnection = (HttpURLConnection) url.openConnection();
urlConnection.setRequestProperty("User-Agent", "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2");
try {
BufferedReader in = new BufferedReader(
new InputStreamReader(new BufferedInputStream(urlConnection.getInputStream())));
String inputLine;
while ((inputLine = in.readLine()) != null)
whole.append(inputLine);
in.close();
} catch (IOException e) {}
finally {
urlConnection.disconnect();
}
} catch (Exception e) {}
Document doc = Parser.parse(whole.toString(), Constants.FORUM);
Elements threads = doc.select("TOPICS > .topic_title");
Elements authors = doc.select("a[hovercard-ref]");
// for (Element author : authors) {
// authorArray.add(author.text());
// }
// cleanAuthors();
if (threads.isEmpty()) {
Log.d("POC", "EMPTY BRO!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!11");
}
// for (Element thread : threads) {
// titleArray = new TitleResults();
// Log.d("POC", thread.toString());
//
// titleArray.setAuthorDate(authorArray.get(0));
// authorArray.remove(0);
//Thread title
// threadTitle = thread.text();
// titleArray.setItemName(threadTitle);
//
// //Thread link
// String threadStr = thread.attr("abs:href");
// String endTag = "/page__view__getnewpost"; //trim link
// threadStr = new String(threadStr.replace(endTag, ""));
// threadArray.add(threadStr);
// results.add(titleArray);
// }
}
};
getTitlesThread.start();
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
根据您的描述,这应该可行(没有实际的 HTML 输入很难确定):
This ought to work, given your description (hard to be certain without the actual HTML input):
删除文档中您不想解析的部分:
其中
是我想要忽略的内容的开头,而
< ;!-- 开始主题 -->
是结束。Remove the part of the document that you don't want to parse with:
Where
<!-- end ad tag -->
was the beginning of what I wanted to ignore and<!-- BEGIN TOPICS -->
was the end.