Node.js 中的 Zombie.js 无法抓取某些网站
下面的简单脚本返回一堆垃圾。它适用于大多数网站,但不适用于 william hill:
var Browser = require("zombie");
var assert = require("assert");
// Load the page from localhost
browser = new Browser()
browser.visit("http://sports.williamhill.com/bet/en-gb/betting/y/5/et/Football.html", function () {
browser.wait(function(){
console.log(browser.html());
});
});
使用节点输出运行
:
S����J����ꪙRUp�kf�6��Efr2�Riz�����^��0�X� ��{�^�a�yp��p�����Ή��`��(����S]-��'N�8q������/����?��x�� u;��������Ei��>��-���3����G�Ee��,��mF���MI�� Q�2������ZG�O�J�^S�C~g��JO�ti�Oq���P����ET�n;v������ v���D�tvJn��J�8'��햷r�v:��m��J��Z�nh�]�� ����Z����.{Z��ٲl�B'�.¶D�~$n�/��u"�z������Ni��"����\00_I\00 \��S��O�E8{"�m;�h��,o ��Q�y��;��a[������c��q�D�띊?��/|?:�;��Z!}��/�w�h�< �������%������A�K=-a��~'
(实际输出要长得多)
有人知道为什么会发生这种情况,特别是为什么它发生在我真正想要抓取的唯一网站上???
谢谢
The simple script below returns a bunch of rubbish. It works for most websites, but not william hill:
var Browser = require("zombie");
var assert = require("assert");
// Load the page from localhost
browser = new Browser()
browser.visit("http://sports.williamhill.com/bet/en-gb/betting/y/5/et/Football.html", function () {
browser.wait(function(){
console.log(browser.html());
});
});
run with node
output:
S����J����ꪙRUݒ�kf�6���Efr2�Riz�����^��0�X�
��{�^�a�yp��p�����Ή��`��(���S]-��'N�8q�����/���?�ݻ��u;�݇�ׯ�Eiٲ>��-���3�ۗG�Ee�,��mF���MI��Q�۲������ڊ�ZG��O�J�^S�C~g��JO�緹�Oݎ���P����ET�n;v������v���D�tvJn��J�8'��햷r�v:��m��J��Z�nh�]�� ����Z����.{Z��Ӳl�B'�.¶D�~$n�/��u"�z�����Ni��"Nj��\00_I\00\��S��O�E8{"�m;�h��,o��Q�y��;��a[������c��q�D�띊?��/|?:�;��Z!}��/�wے�h�<�������%������A�K=-a��~'
(actual output is much longer)
Anyone know why this happens, and specifically why it happens on the only site i actually want to scrape???
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我很久以前就放弃了这种方法,但如果有人感兴趣,我会得到一位僵尸.js 开发人员的回复。
https://github.com/assaf/zombie/issues/251#issuecomment-5969175
他说:“僵尸现在将发送接受编码标头以表明它不支持 gzip。”
感谢所有对此进行调查的人。
I have abandoned this method long ago, but in case anyone is interested I got a reply from one of the zombie.js devs.
https://github.com/assaf/zombie/issues/251#issuecomment-5969175
He says: "Zombie will now send accept-encoding header to indicate it does not support gzip."
Thank you all who looked into this.
相同的代码适用于其他站点(也使用 gzip 进行回复),因此这不是代码问题。
我的猜测是该网站正在检测您没有运行浏览器并防御数据提取。
The same code works for other sites (which also use gzip to reply) so it's not a code problem.
My guess is the site is detecting that you are not running a browser and defending against data extraction.