txt 模式下的网页抓取

发布于 2024-09-08 12:04:41 字数 296 浏览 1 评论 0原文

我目前正在使用 watir 对网站进行网络抓取,隐藏常用 HTML 源中的所有数据。如果我没记错的话,他们正在使用 XML 和那些 AJAX 技术来隐藏它。 Firefox 可以看到它,但它是通过“DOM 选择源”显示的。

一切工作正常,但现在我正在寻找一个与 watir 等效的工具,但一切都需要在没有浏览器的情况下完成。一切都需要在txt文件中完成。

事实上,现在,watir 正在使用我的浏览器来模拟该页面,并向我返回我正在查找的整个 html 代码。我想要同样的但没有浏览器。

是否可以 ?

谢谢 问候 德

I am currently using watir to do a web scraping of a website hiding all data from the usual HTML source. If I am not wrong, they are using XML and those AJAX technology to hide it. Firefox can see it but it is displayed via "DOM Source of selection".

Everything works fine but now I am looking for an equivalent tool as watir but everything need to be done without a browser. Everything need to be done in txt file.

In fact right now, watir is using my browser to emulate the page and return me the whole html code I am looking. I would like to the same but without the browser.

Is it possible ?

Thanks
Regards
Tak

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

柠檬心 2024-09-15 12:04:41

您最好的猜测是使用类似 webscarab 并捕获 AJAX 请求的 URL您的浏览器正在执行操作。
这样,您就可以通过使用任何 HTTP 库模拟这些调用来自己获取“重要”数据

Your best guess would be to use something like webscarab and capture the URLS of the AJAX requests your browser is doing.
That way, you can just grab the "important" data yourself by simulating those calls with any HTTP library

或十年 2024-09-15 12:04:41

只需一点点 Python 编码就可以实现。

我编写了一个简单的脚本来获取货运办事处的位置。

第一步

  1. 例如,使用 Google Chrome 打开 ajax 页面,虽然是土耳其语,但您可以理解。
    http://www.yurticikargo.com/bilgi-servisleri/Sayfalar /en-yakin-sube.aspx
  2. 按 F12 显示底部开发人员工具并导航到网络选项卡。
  3. 导航底部的 XHR 选项卡。
  4. 通过选择第一个组合框中的项目来发出 AJAX 请求。然后转到“标题”选项卡
  5. 您将在左侧窗格中GetTownByCity,单击它并检查它。

    请求 URL:(...)/_layouts/ArikanliHolding.YurticiKargo.WebSite/ajaxproxy-
    sswservices.aspx/GetTownByCity

    请求方式:POST

    状态代码:200 OK

  6. 请求负载树项中,您将参见

    请求负载:{cityId:34}
    标头。

  7. 这将指导我们实现Python代码。

我们开始做吧。

#!/usr/bin/env python
# -*- coding: utf-8 -*-    
import requests
import json
# import simplejson as json
baseUrl = 'http://www.yurticikargo.com/'
ajaxRoot = '_layouts/ArikanliHolding.YurticiKargo.WebSite/'
getTown = 'ajaxproxy-sswservices.aspx/GetTownByCity'
urlGetTown = baseUrl + ajaxRoot + getTown
headers = {'content-type': 'application/json','encoding':'utf-8'}  # We are sending JSON headers, equivalent to Python dictionary
for plaka in range(1,82): # Because Turkiye has number plates from 1 to 81
    payload = {'cityId':plaka}
    r = requests.post(url, data=json.dumps(payload), headers=headers)
    data = r.json() # Returning data is in JSON format, if you need HTML use r.content()
    # ... Process the fetched data with JSON parser,
    # If HTML format, Beautiful Soup, Lxml, or etc...

请注意,此代码是我的工作代码的一部分,并且是即时编写的,最重要的是我没有测试它。它可能需要进行一些小的修改才能运行。

It is possible with a little Python coding.

I wrote a simple script to fetch locations of cargo offices.

First steps

  1. Open the ajax page with Google Chrome for example, in Turkish but you can understand it.
    http://www.yurticikargo.com/bilgi-servisleri/Sayfalar/en-yakin-sube.aspx
  2. Press F12 to show bottom developer tools and navigate to Network tab.
  3. Navigate XHR tab on the bottom.
  4. Make an AJAX request by selecting an item in the first combobox. And go to Headers Tab
  5. You will GetTownByCity on left pane, click it and inspect it.

    Request URL: (...)/_layouts/ArikanliHolding.YurticiKargo.WebSite/ajaxproxy-
    sswservices.aspx/GetTownByCity

    Request Method:POST

    Status Code:200 OK

  6. In the Request Payload tree item you will see

    Request Payload :{cityId:34}
    header.

  7. This will guide us to implement a python code.

Lets do it.

#!/usr/bin/env python
# -*- coding: utf-8 -*-    
import requests
import json
# import simplejson as json
baseUrl = 'http://www.yurticikargo.com/'
ajaxRoot = '_layouts/ArikanliHolding.YurticiKargo.WebSite/'
getTown = 'ajaxproxy-sswservices.aspx/GetTownByCity'
urlGetTown = baseUrl + ajaxRoot + getTown
headers = {'content-type': 'application/json','encoding':'utf-8'}  # We are sending JSON headers, equivalent to Python dictionary
for plaka in range(1,82): # Because Turkiye has number plates from 1 to 81
    payload = {'cityId':plaka}
    r = requests.post(url, data=json.dumps(payload), headers=headers)
    data = r.json() # Returning data is in JSON format, if you need HTML use r.content()
    # ... Process the fetched data with JSON parser,
    # If HTML format, Beautiful Soup, Lxml, or etc...

Note that this code is a part of my working code and it is written on the fly, the most important is I did not test it. It may require small modifications to run it.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文