Node.js 抓取 ASU 课程

发布于 2024-12-19 21:15:17 字数 830 浏览 2 评论 0原文

我对 Node.js 还很陌生,所以如果我不知道我在说什么,请提前道歉。

我正在尝试从 ASU 的课程目录 (https://webapp4.asu.edu/catalog/) 中删除一些课程,并使用 Zombie、Node.IO 和 HTTPS api 进行了多次尝试。在这两种情况下,我都遇到了重定向循环。

我想知道是否是因为我没有正确设置标题?

下面是我使用的示例代码(不是 Zombie/Node.IO):

var https = require('https');

var option = {
  host: 'webapp4.asu.edu',
  path: '/catalog',
  method: 'GET',
  headers: {
    'set-cookie': 'onlineCampusSelection=C'
  }
};

var req = https.request(options, function(res) {
console.log("statusCode: ", res.statusCode);
console.log("headers: ", res.headers);
  res.on('data', function(d) {
    process.stdout.write(d);
  });
});

只是为了澄清一下,我通常在使用 Node.js 进行抓取时没有遇到问题。然而,更具体地说,亚利桑那州立大学的课程目录给我带来了麻烦。

感谢你们给我的任何想法,谢谢!

更新:如果我使用从 Chrome/FF 获得的 JSESSIONID 创建 cookie,我的请求就会成功通过。有没有办法让我请求/创建 JSESSIONID?

I'm pretty new to Node.js, so apologies in advance if I don't know what I'm talking about.

I'm trying to scrape some courses off ASU's course catalog (https://webapp4.asu.edu/catalog/) and have made numerous attempts using Zombie, Node.IO, and the HTTPS api. In both cases I've run into a redirect loop.

I'm wondering if it's because I'm not setting my headers properly?

Below is a sample code of what I used (not Zombie/Node.IO):

var https = require('https');

var option = {
  host: 'webapp4.asu.edu',
  path: '/catalog',
  method: 'GET',
  headers: {
    'set-cookie': 'onlineCampusSelection=C'
  }
};

var req = https.request(options, function(res) {
console.log("statusCode: ", res.statusCode);
console.log("headers: ", res.headers);
  res.on('data', function(d) {
    process.stdout.write(d);
  });
});

Just to clarify, I'm not having trouble with scraping with Node.js in general. More specifically, however, is ASU's course catalog that is giving me trouble.

Appreciate any ideas you guys could give me, thanks!

Update: My request successfully went through if I create a cookie with a JSESSIONID I got from Chrome/FF. Is there a way for me to request/create a JSESSIONID?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

丑丑阿 2024-12-26 21:15:17

看起来服务器设置了 JSESSIONID cookie,然后重定向走了,所以如果你想获取 cookie,你需要告诉 node.js 不要遵循重定向。我不知道如何使用 httphttps 包执行此操作,但您可以通过 npm 获取另一个包:request,它可以让你做到这一点。下面是一个可以帮助您入门的示例:

var request = require("request");

var options = {
  url: "https://webapp4.asu.edu/catalog/",
  followredirect: false,
}

request.get(options, function(error, response, body) {
  console.log(response.headers['set-cookie']);
});

输出应如下所示:

[ 'JSESSIONID=B43CC3BB09FFCDE07AE6B3B702717431.catalog1; Path=/catalog; Secure' ]

It looks like the server sets the JSESSIONID cookie and then redirects away, so you need to tell node.js not to follow redirects if you want to grab the cookie. I don't know how to do this with the http or https packages, but there is another package you can get via npm: request, which lets you do it. Here's a sample that should get you started:

var request = require("request");

var options = {
  url: "https://webapp4.asu.edu/catalog/",
  followredirect: false,
}

request.get(options, function(error, response, body) {
  console.log(response.headers['set-cookie']);
});

Output should look something like this:

[ 'JSESSIONID=B43CC3BB09FFCDE07AE6B3B702717431.catalog1; Path=/catalog; Secure' ]
草莓酥 2024-12-26 21:15:17

我强烈建议将 jsDOM jQuery(用于节点)。我已经多次使用它来进行缩放,因为它非常容易。

这是 jsdom 的自述文件中的示例:

// Count all of the links from the nodejs build page
var jsdom = require("jsdom");

jsdom.env("http://nodejs.org/dist/", [
  'http://code.jquery.com/jquery-1.5.min.js'
],
function(errors, window) {
  console.log("there have been", window.$("a").length, "nodejs releases!");
});

希望有所帮助,jsdom 使将抓取实验组合在一起变得非常容易(至少对我来说)。

Id highly recommend using jsDOM in conjunction with jQuery(for node). I've used it many many times for scaping as it makes it super easy.

heres the example from jsdom's readme:

// Count all of the links from the nodejs build page
var jsdom = require("jsdom");

jsdom.env("http://nodejs.org/dist/", [
  'http://code.jquery.com/jquery-1.5.min.js'
],
function(errors, window) {
  console.log("there have been", window.$("a").length, "nodejs releases!");
});

Hope that helps, jsdom has made it real easy to hack together scraping experiments (for me at least).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文