帮助使用 JSoup 抓取 HTML
这里是一个初学者,正在开展一个个人项目,将我的学校课程内容抓取为易于阅读的表格格式,但在从网站抓取数据的第一步中遇到了麻烦。
我刚刚将 JSoup 库添加到 Eclipse 中的项目中,现在在使用 Jsoup 文档时初始化连接时遇到问题。
最后,我的目标是获取每个班级名称/时间/描述,但现在我只想获取名称。源网站的 HTML 如下所示:
<td class='CourseNum'><img src='images/minus.gif' class='ICS3330 SW' onclick="toggledetails('CS3330')
我的第一个猜测是 getElementsByTag(td),然后在这些元素中查询 onclick= 的参数或 'class' 参数的值,通过删除最初的“I”来清理它”和后缀“SW”留下名称“CS3330”。
现在开始实际的实现:
Document doc = Jsoup.parse("UTF-8", "http://rabi.phys.virginia.edu/mySIS/CS2/page.php?Semester=1118&Type=Group&Group=CompSci").get();
Elements td = doc.getElementsByTag("td");
此时,我已经遇到了问题(即使我没有偏离文档中提供的示例),并且希望获得一些关于让我的代码正常运行的指导!
编辑:明白了!谢谢大家!
Little bit of a beginner here, working on a personal project to scrape my schools course offerings into a easy-to-read tabular format, but am having trouble with the initial step of scraping the data from the site.
I just added the JSoup library to my project in eclipse, and am now having trouble initializing the connection when using the documentation for Jsoup.
In the end, my goal is to grab each class name / time / description, but for now I want to just grab the name. The HTML of the source website appears like this:
<td class='CourseNum'><img src='images/minus.gif' class='ICS3330 SW' onclick="toggledetails('CS3330')
My first guess was to getElementsByTag(td), and then query these elements for the parameter of onclick= or the value of the 'class' parameter, cleaning it up by removing the initial "I" and the suffix of " SW" leaving behind the name "CS3330."
Now onto the actual implementation:
Document doc = Jsoup.parse("UTF-8", "http://rabi.phys.virginia.edu/mySIS/CS2/page.php?Semester=1118&Type=Group&Group=CompSci").get();
Elements td = doc.getElementsByTag("td");
At this point, I am already running into problems (even though I am not straying far from the examples provided in the documentation) and would appreciate some guidance on getting my code to function!
edit: GOT IT! Thank you all!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
根据 文档 你应该这样做:
parse()
方法用于文件。According to documentation you should be doing:
The
parse()
method is for files.我刚刚下载了 JSoup 并在你们学校的网站上尝试了一下,得到了这样的输出:
太酷了!不过弗拉德是对的;使用 connect(...) 方法。 1+ 给 Vlad
其他建议和提示:
这些是我在小程序中使用的常量:
这些是我在抓取方法中使用的变量:
编辑 1
根据您最近的评论,我认为您想得太多了。对我来说效果很好的是这个简单的算法:
I just downloaded JSoup and tried it out on your school's website and got this output:
Too flippin' cool! Vlad is right though; use the connect(...) method. 1+ to Vlad
Other suggestions and hints:
These are the constants that I used in my little program:
And these are the variables I used inside the scraping method:
Edit 1
Based on your recent comments, I think that you're over-thinking things a bit. What worked well for me is this simple algorithm:
td.get(i);