Python Web刮擦:如何配置Multipart/form-data有效载荷
我正在尝试抓取这个网站: https://kfz.virtuelles-rathaus.de/igv2-man/servlet/ Internetgeschaeftsvorfaelle
但我不知道如何设置有效的请求正文。 状态代码为 200,因此请求本身有效。 在响应中它说它无法处理数据,因为我使用了浏览器导航。它希望我改用网站按钮。
我认为这是因为有效负载配置不正确,但我无法让它工作。
我已经尝试过以下操作:
- 手动设置标头
- 使用 requests_toolbelt.multipart.encoder
- 自己创建 multipart/data-format 的边界
从 Chrome 开发工具复制的 decodedPayload
WKZ_UNTERSCH_Z:WT
WKZ_ERKENN_Z:SJ
WKZ_ZIFFERN:454
WKZ_SUCHMERKMAL:NULL
BTN_WKZSUCHE:这样
时间:2022040815031191
从 Chrome 开发工具复制的编码负载
------WebKitFormBoundaryj9dFOsSgrDr5dSwA
内容处置:表单数据;名称=“WKZ_UNTERSCH_Z”WT
------WebKitFormBoundaryj9dFOsSgrDr5dSwA
内容处置:表单数据;名称=“WKZ_ERKENN_Z”SJ
------WebKitFormBoundaryj9dFOsSgrDr5dSwA
内容处置:表单数据;名称=“WKZ_ZIFFERN”454
------WebKitFormBoundaryj9dFOsSgrDr5dSwA
内容处置:表单数据;名称=“WKZ_SUCHMERKMAL”空
------WebKitFormBoundaryj9dFOsSgrDr5dSwA
内容处置:表单数据;名称=“BTN_WKZSUCHE”素辰
------WebKitFormBoundaryj9dFOsSgrDr5dSwA
内容处置:表单数据;名称=“ZEITSTEMPEL”2022040815031191
------WebKitFormBoundaryj9dFOsSgrDr5dSwA--
我的代码:
import requests
from datetime import datetime
def main():
payload = {"WKZ_UNTERSCH_Z": "WT",
"WKZ_ERKENN_Z": "GH",
"WKZ_ZIFFERN": "454",
"WKZ_SUCHMERKMAL": "NULL",
"BTN_WKZSUCHE": "suchen",
"ZEITSTEMPEL": datetime.strftime(datetime.now(), '%Y%m%d%H%S%M%f')[:-4]}
url = 'https://kfz.virtuelles-rathaus.de/igv2-man/servlet/Internetgeschaeftsvorfaelle'
# Initialize Session and get Cookie with session ID
s = requests.Session()
r = s.get(f'{url}?MANDANT=08337000&AUFRUF=WKZ')
r = s.post(
f'{url}', data=payload, verify=False)
# Save Response for further scraping
with open('z_1.html', 'w') as f:
f.write(str(r.text))
if __name__ == '__main__':
main()
我已经提前感谢你们的帮助
编辑:
创建的正文使用多部分编码器:
--b3dccffd58a47883c42249db16600856
内容处置:表单数据;名称=“WKZ_UNTERSCH_Z”WT
--b3dccffd58a47883c42249db16600856
内容处置:表单数据;名称=“WKZ_ERKENN_Z”SJ
--b3dccffd58a47883c42249db16600856
内容处置:表单数据;名称=“WKZ_ZIFFERN”454
--b3dccffd58a47883c42249db16600856
内容处置:表单数据;名称=“WKZ_SUCHMERKMAL”空
--b3dccffd58a47883c42249db16600856
内容处置:表单数据;名称=“BTN_WKZSUCHE”素辰
--b3dccffd58a47883c42249db16600856
内容处置:表单数据;名称=“ZEITSTEMPEL”2022040909485493
--b3dccffd58a47883c42249db16600856--
I'm trying to scrape this site:
https://kfz.virtuelles-rathaus.de/igv2-man/servlet/Internetgeschaeftsvorfaelle
But I can't figure out how to set up a valid request body.
The Status Code is 200, so the request itself works.
In the response it says it couldn't process to data, because I used the Browser navigation. It wants me to use the Website buttons instead.
I think it is because the payload isn't configured correctly, but I can't get it to work.
I already tried following:
- Setting the headers manually
- Using requests_toolbelt.multipart.encoder
- Creating the boundary for the multipart/data-format myself
The decodedPayload copied from Chrome Dev Tools
WKZ_UNTERSCH_Z: WT
WKZ_ERKENN_Z: SJ
WKZ_ZIFFERN: 454
WKZ_SUCHMERKMAL: NULL
BTN_WKZSUCHE: suchen
ZEITSTEMPEL: 2022040815031191
The encoded Payload copied from Chrome Dev Tools
------WebKitFormBoundaryj9dFOsSgrDr5dSwA
Content-Disposition: form-data; name="WKZ_UNTERSCH_Z"WT
------WebKitFormBoundaryj9dFOsSgrDr5dSwA
Content-Disposition: form-data; name="WKZ_ERKENN_Z"SJ
------WebKitFormBoundaryj9dFOsSgrDr5dSwA
Content-Disposition: form-data; name="WKZ_ZIFFERN"454
------WebKitFormBoundaryj9dFOsSgrDr5dSwA
Content-Disposition: form-data; name="WKZ_SUCHMERKMAL"NULL
------WebKitFormBoundaryj9dFOsSgrDr5dSwA
Content-Disposition: form-data; name="BTN_WKZSUCHE"suchen
------WebKitFormBoundaryj9dFOsSgrDr5dSwA
Content-Disposition: form-data; name="ZEITSTEMPEL"2022040815031191
------WebKitFormBoundaryj9dFOsSgrDr5dSwA--
My Code:
import requests
from datetime import datetime
def main():
payload = {"WKZ_UNTERSCH_Z": "WT",
"WKZ_ERKENN_Z": "GH",
"WKZ_ZIFFERN": "454",
"WKZ_SUCHMERKMAL": "NULL",
"BTN_WKZSUCHE": "suchen",
"ZEITSTEMPEL": datetime.strftime(datetime.now(), '%Y%m%d%H%S%M%f')[:-4]}
url = 'https://kfz.virtuelles-rathaus.de/igv2-man/servlet/Internetgeschaeftsvorfaelle'
# Initialize Session and get Cookie with session ID
s = requests.Session()
r = s.get(f'{url}?MANDANT=08337000&AUFRUF=WKZ')
r = s.post(
f'{url}', data=payload, verify=False)
# Save Response for further scraping
with open('z_1.html', 'w') as f:
f.write(str(r.text))
if __name__ == '__main__':
main()
I already thank you guys in advance for your help
EDIT:
The Body that's created when using the MultipartEncoder:
--b3dccffd58a47883c42249db16600856
Content-Disposition: form-data; name="WKZ_UNTERSCH_Z"WT
--b3dccffd58a47883c42249db16600856
Content-Disposition: form-data; name="WKZ_ERKENN_Z"SJ
--b3dccffd58a47883c42249db16600856
Content-Disposition: form-data; name="WKZ_ZIFFERN"454
--b3dccffd58a47883c42249db16600856
Content-Disposition: form-data; name="WKZ_SUCHMERKMAL"NULL
--b3dccffd58a47883c42249db16600856
Content-Disposition: form-data; name="BTN_WKZSUCHE"suchen
--b3dccffd58a47883c42249db16600856
Content-Disposition: form-data; name="ZEITSTEMPEL"2022040909485493
--b3dccffd58a47883c42249db16600856--
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我找到了问题的答案
我不是数据格式的问题,但我使用了错误的时间戳。
解决方案是使用先前响应的时间戳而不是当前时间。
I found the answer to my problem
I wasn't a problem with the data format, but I used the wrong timestamp.
The solution is to use the timestamp of the previous response and not the current time.