帮助使用 JSoup 抓取 HTML

发布于 2024-11-28 11:51:14 字数 752 浏览 2 评论 0原文

这里是一个初学者,正在开展一个个人项目,将我的学校课程内容抓取为易于阅读的表格格式,但在从网站抓取数据的第一步中遇到了麻烦。

我刚刚将 JSoup 库添加到 Eclipse 中的项目中,现在在使用 Jsoup 文档时初始化连接时遇到问题。

最后,我的目标是获取每个班级名称/时间/描述,但现在我只想获取名称。源网站的 HTML 如下所示:

<td class='CourseNum'><img src='images/minus.gif' class='ICS3330 SW' onclick="toggledetails('CS3330')

我的第一个猜测是 getElementsByTag(td),然后在这些元素中查询 onclick= 的参数或 'class' 参数的值,通过删除最初的“I”来清理它”和后缀“SW”留下名称“CS3330”。

现在开始实际的实现:

Document doc = Jsoup.parse("UTF-8", "http://rabi.phys.virginia.edu/mySIS/CS2/page.php?Semester=1118&Type=Group&Group=CompSci").get();
Elements td = doc.getElementsByTag("td");

此时,我已经遇到了问题(即使我没有偏离文档中提供的示例),并且希望获得一些关于让我的代码正常运行的指导!

编辑:明白了!谢谢大家!

Little bit of a beginner here, working on a personal project to scrape my schools course offerings into a easy-to-read tabular format, but am having trouble with the initial step of scraping the data from the site.

I just added the JSoup library to my project in eclipse, and am now having trouble initializing the connection when using the documentation for Jsoup.

In the end, my goal is to grab each class name / time / description, but for now I want to just grab the name. The HTML of the source website appears like this:

<td class='CourseNum'><img src='images/minus.gif' class='ICS3330 SW' onclick="toggledetails('CS3330')

My first guess was to getElementsByTag(td), and then query these elements for the parameter of onclick= or the value of the 'class' parameter, cleaning it up by removing the initial "I" and the suffix of " SW" leaving behind the name "CS3330."

Now onto the actual implementation:

Document doc = Jsoup.parse("UTF-8", "http://rabi.phys.virginia.edu/mySIS/CS2/page.php?Semester=1118&Type=Group&Group=CompSci").get();
Elements td = doc.getElementsByTag("td");

At this point, I am already running into problems (even though I am not straying far from the examples provided in the documentation) and would appreciate some guidance on getting my code to function!

edit: GOT IT! Thank you all!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

谷夏 2024-12-05 11:51:15

根据 文档 你应该这样做:

Document doc = Jsoup.connect(url).get();

parse() 方法用于文件。

According to documentation you should be doing:

Document doc = Jsoup.connect(url).get();

The parse() method is for files.

过期情话 2024-12-05 11:51:15

我刚刚下载了 JSoup 并在你们学校的网站上尝试了一下,得到了这样的输出:

Unit: Computer Science
   CS 1010: Introduction to Information Technology
   CS 1110: Introduction to Programming
   CS 1111: Introduction to Programming
   CS 1112: Introduction to Programming
   CS 1120: From Ada and Euclid to Quantum Computing and the World Wide Web
   CS 2102: Discrete Mathematics I
   CS 2110: Software Development Methods
   CS 2150: Program and Data Representation
   CS 2220: Engineering Software
   CS 2330: Digital Logic Design
   CS 2501: Special Topics in Computer Science
   CS 3102: Theory of Computation
   CS 3330: Computer Architecture
   CS 4102: Algorithms
   CS 4240: Principles of Software Design
   CS 4414: Operating Systems
   CS 4444: Introduction to Parallel Computing
   CS 4457: Computer Networks
   CS 4501: Special Topics in Computer Science
   CS 4753: Electronic Commerce Technologies
   CS 4810: Introduction to Computer Graphics
   CS 4993: Independent Study
   CS 4998: Distinguished BA Majors Research
   CS 6161: Design and Analysis of Algorithms
   CS 6190: Computer Science Perspectives
   CS 6354: Computer Architecture
   CS 6444: Introduction to Parallel Computing
   CS 6501: Special Topics in Computer Science
   CS 6610: Programming Languages
   CS 7457: Computer Networks
   CS 7993: Independent Study
   CS 7995: Supervised Project Research
   CS 8501: Special Topics in Computer Science
   CS 8524: Topics in Software Engineering
   CS 8897: Graduate Teaching Instruction
   CS 8999: Thesis
   CS 9999: Dissertation

太酷了!不过弗拉德是对的;使用 connect(...) 方法。 1+ 给 Vlad

其他建议和提示:
这些是我在小程序中使用的常量:

   private static final String URL = "http://rabi.phys.virginia.edu/mySIS/CS2/" +
        "page.php?Semester=1118&Type=Group&Group=CompSci";
   private static final String TD_TAG = "td";
   private static final String CLASS_ATTRIB = "class";
   private static final String CLASS_ATTRIB_UNIT_NAME = "UnitName";
   private static final String CLASS_ATTRIB_COURSE_NUM = "CourseNum";
   private static final String CLASS_ATTRIB_COURSE_NAME = "CourseName";

这些是我在抓取方法中使用的变量:

     String unitName = "";
     List<String> courseNumbNameList = new ArrayList<String>();
     String courseNumbName = "";

编辑 1
根据您最近的评论,我认为您想得太多了。对我来说效果很好的是这个简单的算法:

  • 创建上面列出的 3 个变量,
  • 按照 Vlad 的建议获取我的文档。
  • 创建一个 td Elements 变量并将所有具有 td 标签的元素分配给它。
  • 使用 for 循环,int i 从 0 到 << td.size() 并使用 td.get(i); 获取每个元素,元素
  • 在循环内检查元素的类属性。
  • 如果属性字符串等于 CLASS_ATTRIB_UNIT_NAME 字符串(见上文),则获取元素的文本并使用它来设置 unitName 变量。
  • 如果属性字符串等于 CLASS_ATTRIB_COURSE_NUM,则将 courseNumbName 设置为元素的文本。
  • 如果属性字符串等于 CLASS_ATTRIB_COURSE_NAME,则将元素的文本附加到 courseNumbName 字符串,将该字符串添加到数组列表,并将 courseNumbName = 设置为“”。

I just downloaded JSoup and tried it out on your school's website and got this output:

Unit: Computer Science
   CS 1010: Introduction to Information Technology
   CS 1110: Introduction to Programming
   CS 1111: Introduction to Programming
   CS 1112: Introduction to Programming
   CS 1120: From Ada and Euclid to Quantum Computing and the World Wide Web
   CS 2102: Discrete Mathematics I
   CS 2110: Software Development Methods
   CS 2150: Program and Data Representation
   CS 2220: Engineering Software
   CS 2330: Digital Logic Design
   CS 2501: Special Topics in Computer Science
   CS 3102: Theory of Computation
   CS 3330: Computer Architecture
   CS 4102: Algorithms
   CS 4240: Principles of Software Design
   CS 4414: Operating Systems
   CS 4444: Introduction to Parallel Computing
   CS 4457: Computer Networks
   CS 4501: Special Topics in Computer Science
   CS 4753: Electronic Commerce Technologies
   CS 4810: Introduction to Computer Graphics
   CS 4993: Independent Study
   CS 4998: Distinguished BA Majors Research
   CS 6161: Design and Analysis of Algorithms
   CS 6190: Computer Science Perspectives
   CS 6354: Computer Architecture
   CS 6444: Introduction to Parallel Computing
   CS 6501: Special Topics in Computer Science
   CS 6610: Programming Languages
   CS 7457: Computer Networks
   CS 7993: Independent Study
   CS 7995: Supervised Project Research
   CS 8501: Special Topics in Computer Science
   CS 8524: Topics in Software Engineering
   CS 8897: Graduate Teaching Instruction
   CS 8999: Thesis
   CS 9999: Dissertation

Too flippin' cool! Vlad is right though; use the connect(...) method. 1+ to Vlad

Other suggestions and hints:
These are the constants that I used in my little program:

   private static final String URL = "http://rabi.phys.virginia.edu/mySIS/CS2/" +
        "page.php?Semester=1118&Type=Group&Group=CompSci";
   private static final String TD_TAG = "td";
   private static final String CLASS_ATTRIB = "class";
   private static final String CLASS_ATTRIB_UNIT_NAME = "UnitName";
   private static final String CLASS_ATTRIB_COURSE_NUM = "CourseNum";
   private static final String CLASS_ATTRIB_COURSE_NAME = "CourseName";

And these are the variables I used inside the scraping method:

     String unitName = "";
     List<String> courseNumbNameList = new ArrayList<String>();
     String courseNumbName = "";

Edit 1
Based on your recent comments, I think that you're over-thinking things a bit. What worked well for me is this simple algorithm:

  • Create the 3 variables I have listed above
  • Get my document as Vlad recommends.
  • Create a td Elements variable and assign to it all elements that have a td tag.
  • Use a for loop with int i going from 0 to < td.size() and get each Element, element using td.get(i);
  • Inside the loop check the element's class attribute.
  • If the attribute String equals the CLASS_ATTRIB_UNIT_NAME String (see above), get the element's text and use it to set the unitName variable.
  • If the attribute String equals CLASS_ATTRIB_COURSE_NUM set the courseNumbName to the element's text.
  • If the attribute String equals CLASS_ATTRIB_COURSE_NAME append the element's text to the courseNumbName String, add the String to the array list, and set courseNumbName = to "".
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文