可以使用零食从顽固的网页中获取JSON内容

发布于 2025-02-09 03:34:00 字数 3514 浏览 2 评论 0原文

我正在尝试使用scrapy创建一个脚本来从此网页。我已经在脚本中使用了标题，但是当我运行它时，我总是会得到jsondecodeerror。该站点有时会投掷验证码，但并非总是如此。但是，即使我使用VPN，我也从未使用下面的脚本获得任何成功。我该如何修复？

这就是我尝试的方式：

import scrapy
import urllib

class ImmobilienScoutSpider(scrapy.Spider):
    name = "immobilienscout"
    start_url = "https://www.immobilienscout24.de/Suche/de/nordrhein-westfalen/wohnung-kaufen"
    
    headers = {
        'accept': 'application/json; charset=utf-8',
        'accept-encoding': 'gzip, deflate, br',
        'accept-language': 'en-US,en;q=0.9',
        'x-requested-with': 'XMLHttpRequest',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36',
    }

    params = {
        'price': '1000.0-',
        'constructionyear': '-2000',
        'pagenumber': '1'
    }

    def start_requests(self):
        req_url = f'{self.start_url}?{urllib.parse.urlencode(self.params)}'
        yield scrapy.Request(
            url=req_url,
            headers=self.headers,
            callback=self.parse,
        )

    def parse(self,response):
        yield {"response":response.json()}

这就是输出的外观（截断）：

{"searchResponseModel":{"additional":{"lastSearchApiUrl":"/region?realestatetype=apartmentbuy&price=1000.0-&constructionyear=-2000&pagesize=20&geocodes=1276010&pagenumber=1","title":"Eigentumswohnung in Nordrhein-Westfalen - ImmoScout24","sortingOptions":[{"description":"Standardsortierung","code":0},{"description":"Kaufpreis (höchste zuerst)","code":3},{"description":"Kaufpreis (niedrigste zuerst)","code":4},{"description":"Zimmeranzahl (höchste zuerst)","code":5},{"description":"Zimmeranzahl (niedrigste zuerst)","code":6},{"description":"Wohnfläche (größte zuerst)","code":7},{"description":"Wohnfläche (kleinste zuerst)","code":8},{"description":"Neubau-Projekte (Projekte zuerst)","code":31},{"description":"Aktualität (neueste zuerst)","code":2}],"pagerTemplate":"|Suche|de|nordrhein-westfalen|wohnung-kaufen?price=1000.0-&constructionyear=-2000&pagenumber=%page%","sortingTemplate":"|Suche|de|nordrhein-westfalen|wohnung-kaufen?price=1000.0-&constructionyear=-2000&sorting=%sorting%","world":"LIVING","international":false,"device":{"deviceType":"NORMAL","devicePlatform":"UNKNOWN","tablet":false,"mobile":false,"normal":true}

编辑：

这就是脚本在请求中构建的脚本看起来像：

import requests

link = 'https://www.immobilienscout24.de/Suche/de/nordrhein-westfalen/wohnung-kaufen'

headers = {
    'accept': 'application/json; charset=utf-8',
    'accept-encoding': 'gzip, deflate, br',
    'accept-language': 'en-US,en;q=0.9',
    'x-requested-with': 'XMLHttpRequest',
    'content-type': 'application/json; charset=utf-8',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36',
    'referer': 'https://www.immobilienscout24.de/Suche/de/nordrhein-westfalen/wohnung-kaufen?price=1000.0-&constructionyear=-2000&pagenumber=1',
    # 'cookie': 'hardcoded cookies'
}

params = {
        'price': '1000.0-',
        'constructionyear': '-2000',
        'pagenumber': '2'
}

sess = requests.Session()
sess.headers.update(headers)
resp = sess.get(link,params=params)
print(resp.json())

原文

I'm trying to create a script using scrapy to grab json content from this webpage. I've used headers within the script accordingly but when I run it, I always end up getting JSONDecodeError. The site sometimes throws captcha but not always. However, I've never got any success using the script below even when I used vpn. How can I fix it?

This is how I've tried:

import scrapy
import urllib

class ImmobilienScoutSpider(scrapy.Spider):
    name = "immobilienscout"
    start_url = "https://www.immobilienscout24.de/Suche/de/nordrhein-westfalen/wohnung-kaufen"
    
    headers = {
        'accept': 'application/json; charset=utf-8',
        'accept-encoding': 'gzip, deflate, br',
        'accept-language': 'en-US,en;q=0.9',
        'x-requested-with': 'XMLHttpRequest',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36',
    }

    params = {
        'price': '1000.0-',
        'constructionyear': '-2000',
        'pagenumber': '1'
    }

    def start_requests(self):
        req_url = f'{self.start_url}?{urllib.parse.urlencode(self.params)}'
        yield scrapy.Request(
            url=req_url,
            headers=self.headers,
            callback=self.parse,
        )

    def parse(self,response):
        yield {"response":response.json()}

This is how the output should look like (truncated):

{"searchResponseModel":{"additional":{"lastSearchApiUrl":"/region?realestatetype=apartmentbuy&price=1000.0-&constructionyear=-2000&pagesize=20&geocodes=1276010&pagenumber=1","title":"Eigentumswohnung in Nordrhein-Westfalen - ImmoScout24","sortingOptions":[{"description":"Standardsortierung","code":0},{"description":"Kaufpreis (höchste zuerst)","code":3},{"description":"Kaufpreis (niedrigste zuerst)","code":4},{"description":"Zimmeranzahl (höchste zuerst)","code":5},{"description":"Zimmeranzahl (niedrigste zuerst)","code":6},{"description":"Wohnfläche (größte zuerst)","code":7},{"description":"Wohnfläche (kleinste zuerst)","code":8},{"description":"Neubau-Projekte (Projekte zuerst)","code":31},{"description":"Aktualität (neueste zuerst)","code":2}],"pagerTemplate":"|Suche|de|nordrhein-westfalen|wohnung-kaufen?price=1000.0-&constructionyear=-2000&pagenumber=%page%","sortingTemplate":"|Suche|de|nordrhein-westfalen|wohnung-kaufen?price=1000.0-&constructionyear=-2000&sorting=%sorting%","world":"LIVING","international":false,"device":{"deviceType":"NORMAL","devicePlatform":"UNKNOWN","tablet":false,"mobile":false,"normal":true}

EDIT:

This is how the script built upon requests module looks like:

import requests

link = 'https://www.immobilienscout24.de/Suche/de/nordrhein-westfalen/wohnung-kaufen'

headers = {
    'accept': 'application/json; charset=utf-8',
    'accept-encoding': 'gzip, deflate, br',
    'accept-language': 'en-US,en;q=0.9',
    'x-requested-with': 'XMLHttpRequest',
    'content-type': 'application/json; charset=utf-8',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36',
    'referer': 'https://www.immobilienscout24.de/Suche/de/nordrhein-westfalen/wohnung-kaufen?price=1000.0-&constructionyear=-2000&pagenumber=1',
    # 'cookie': 'hardcoded cookies'
}

params = {
        'price': '1000.0-',
        'constructionyear': '-2000',
        'pagenumber': '2'
}

sess = requests.Session()
sess.headers.update(headers)
resp = sess.get(link,params=params)
print(resp.json())

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

心不设防 2025-02-16 03:34:00

Scrapy的cookiesmiddleware无视'cookie' 标题。
参考： scrapy/scrapy＃1992

Pass cookies explace explactly：explace> explace> explace> explace> explace>

yield scrapy.Request(
    url=req_url,
    headers=self.headers,
    callback=self.parse,
    # Add the following line:
    cookies={k: v.value for k, v in http.cookies.SimpleCookie(self.headers.get('cookie', '')).items()},
),

注意：该站点使用Geetest Catpcha，该网站无法通过简单地渲染页面或使用硒来解决，因此您仍然需要定期更新从浏览器中获取的硬编码cookie（cookie name：reese84），或使用2Captcha之类的服务。

Scrapy's CookiesMiddleware disregards 'cookie' passed in headers.
Reference: scrapy/scrapy#1992

Pass cookies explicitly:

yield scrapy.Request(
    url=req_url,
    headers=self.headers,
    callback=self.parse,
    # Add the following line:
    cookies={k: v.value for k, v in http.cookies.SimpleCookie(self.headers.get('cookie', '')).items()},
),

Note: That site uses GeeTest CAPTCHA, which cannot be solved by simply rendering the page or using Selenium, so you still need to periodically update the hardcoded cookie (cookie name: reese84) taken from the browser, or use a service like 2Captcha.

回复收藏 0 原文

~没有更多了~