txt 模式下的网页抓取
我目前正在使用 watir 对网站进行网络抓取,隐藏常用 HTML 源中的所有数据。如果我没记错的话,他们正在使用 XML 和那些 AJAX 技术来隐藏它。 Firefox 可以看到它,但它是通过“DOM 选择源”显示的。
一切工作正常,但现在我正在寻找一个与 watir 等效的工具,但一切都需要在没有浏览器的情况下完成。一切都需要在txt文件中完成。
事实上,现在,watir 正在使用我的浏览器来模拟该页面,并向我返回我正在查找的整个 html 代码。我想要同样的但没有浏览器。
是否可以 ?
谢谢 问候 德
I am currently using watir to do a web scraping of a website hiding all data from the usual HTML source. If I am not wrong, they are using XML and those AJAX technology to hide it. Firefox can see it but it is displayed via "DOM Source of selection".
Everything works fine but now I am looking for an equivalent tool as watir but everything need to be done without a browser. Everything need to be done in txt file.
In fact right now, watir is using my browser to emulate the page and return me the whole html code I am looking. I would like to the same but without the browser.
Is it possible ?
Thanks
Regards
Tak
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您最好的猜测是使用类似 webscarab 并捕获 AJAX 请求的 URL您的浏览器正在执行操作。
这样,您就可以通过使用任何 HTTP 库模拟这些调用来自己获取“重要”数据
Your best guess would be to use something like webscarab and capture the URLS of the AJAX requests your browser is doing.
That way, you can just grab the "important" data yourself by simulating those calls with any HTTP library
只需一点点 Python 编码就可以实现。
我编写了一个简单的脚本来获取货运办事处的位置。
第一步
http://www.yurticikargo.com/bilgi-servisleri/Sayfalar /en-yakin-sube.aspx
您将在左侧窗格中GetTownByCity,单击它并检查它。
请求 URL:(...)/_layouts/ArikanliHolding.YurticiKargo.WebSite/ajaxproxy-
sswservices.aspx/GetTownByCity
请求方式:POST
状态代码:200 OK
在
请求负载
树项中,您将参见请求负载:{cityId:34}
标头。
这将指导我们实现Python代码。
我们开始做吧。
请注意,此代码是我的工作代码的一部分,并且是即时编写的,最重要的是我没有测试它。它可能需要进行一些小的修改才能运行。
It is possible with a little Python coding.
I wrote a simple script to fetch locations of cargo offices.
First steps
http://www.yurticikargo.com/bilgi-servisleri/Sayfalar/en-yakin-sube.aspx
You will GetTownByCity on left pane, click it and inspect it.
Request URL: (...)/_layouts/ArikanliHolding.YurticiKargo.WebSite/ajaxproxy-
sswservices.aspx/GetTownByCity
Request Method:POST
Status Code:200 OK
In the
Request Payload
tree item you will seeRequest Payload :{cityId:34}
header.
This will guide us to implement a python code.
Lets do it.
Note that this code is a part of my working code and it is written on the fly, the most important is I did not test it. It may require small modifications to run it.