链接没有 url 格式以便抓取它们 scrapy
这是我的代码:
import scrapy
from scrapy import Spider
from scrapy.http import FormRequest
class ProvinciaSpider(Spider):
name = 'provincia'
allowed_domains = ['aduanet.gob.pe']
start_urls = ['http://www.aduanet.gob.pe/cl-ad-itconsmanifiesto/manifiestoITS01Alias?accion=cargaConsultaManifiesto&tipoConsulta=salidaProvincia']
def parse(self, response):
data ={ 'accion': 'consultaManifExpProvincia',
'salidaPro': 'YES',
'strMenu': '-',
'strEmpTransTerrestre': '-',
'CMc1_Anno': '2022',
'CMc1_Numero': '96',
'CG_cadu': '046',
'viat': '1'}
yield FormRequest('http://www.aduanet.gob.pe/cl-ad-itconsmanifiesto/manifiestoITS01Alias', formdata=data, callback=self.parse_form_page)
def parse_form_page(self, response):
table = response.xpath('/html/body/form[1]//td[@class="beta"]/table')
trs = table.xpath('.//tr')[1:]
for tr in trs:
puerto_llegada= tr.xpath('.//td[1]/text()').extract_first().strip()
pais= tr.xpath('.//td[1]/text()').extract_first().strip()
bl= tr.xpath('.//td[3]/text()').extract_first().strip()
peso= tr.xpath('.//td[8]/text()').extract_first().strip()
bultos= tr.xpath('.//td[9]/text()').extract_first().strip()
consignatario= tr.xpath('.//td[12]/text()').extract_first().strip()
embarcador= tr.xpath('.//td[13]/text()').extract_first().strip()
links=tr.xpath('.//td[4]/a/@href')
yield response.follow(links.get(),
callback=self.parse_categories,
meta={'puerto_llegada': puerto_llegada,
'pais': pais,
'bl': bl,
'peso': float("".join(peso.split(','))),
'bultos': float("".join(bultos.split(','))),
'consignatario': consignatario,
'embarcador': embarcador})
def parse_categories(self, response):
puerto_llegada = response.meta['puerto_llegada']
pais = response.meta['pais']
bl = response.meta['bl']
peso = response.meta['peso']
bultos = response.meta['bultos']
consignatario = response.meta['consignatario']
embarcador = response.meta['embarcador']
tabla_des= response.xpath('/html/body/form//td[@class="beta"]/table')
trs3= tabla_des.xpath('.//tr')[1:]
for tr3 in trs3:
descripcion= tr.xpath('.//td[7]/text()').extract_first().strip()
yield {'puerto_llegada': puerto_llegada,
'pais': pais,
'bl': bl,
'peso': PROCESOS,
'bultos': bultos,
'consignatario': consignatario,
'embarcador': embarcador,
'descripcion': descripcion}
我收到此错误:
ValueError:请求网址中缺少方案:javascript:jsDetalle2('154');
我想要从中提取数据的每个链接都具有该格式,因此我用于提取每个链接内的数据的代码不起作用。
链接格式类似于 javascript:jsDetalle2('154'),只是数字发生变化。
问题是它不是 http//........ 或 /manizesto...... 在第一种情况下你只需要点击链接就可以了,在第二种情况下你必须将 URL 的第二部分与第一个响应 URL 连接起来。但这个案例没有,所以我不知道如何让它发挥作用。
我怎样才能写它才能工作?
This is my code:
import scrapy
from scrapy import Spider
from scrapy.http import FormRequest
class ProvinciaSpider(Spider):
name = 'provincia'
allowed_domains = ['aduanet.gob.pe']
start_urls = ['http://www.aduanet.gob.pe/cl-ad-itconsmanifiesto/manifiestoITS01Alias?accion=cargaConsultaManifiesto&tipoConsulta=salidaProvincia']
def parse(self, response):
data ={ 'accion': 'consultaManifExpProvincia',
'salidaPro': 'YES',
'strMenu': '-',
'strEmpTransTerrestre': '-',
'CMc1_Anno': '2022',
'CMc1_Numero': '96',
'CG_cadu': '046',
'viat': '1'}
yield FormRequest('http://www.aduanet.gob.pe/cl-ad-itconsmanifiesto/manifiestoITS01Alias', formdata=data, callback=self.parse_form_page)
def parse_form_page(self, response):
table = response.xpath('/html/body/form[1]//td[@class="beta"]/table')
trs = table.xpath('.//tr')[1:]
for tr in trs:
puerto_llegada= tr.xpath('.//td[1]/text()').extract_first().strip()
pais= tr.xpath('.//td[1]/text()').extract_first().strip()
bl= tr.xpath('.//td[3]/text()').extract_first().strip()
peso= tr.xpath('.//td[8]/text()').extract_first().strip()
bultos= tr.xpath('.//td[9]/text()').extract_first().strip()
consignatario= tr.xpath('.//td[12]/text()').extract_first().strip()
embarcador= tr.xpath('.//td[13]/text()').extract_first().strip()
links=tr.xpath('.//td[4]/a/@href')
yield response.follow(links.get(),
callback=self.parse_categories,
meta={'puerto_llegada': puerto_llegada,
'pais': pais,
'bl': bl,
'peso': float("".join(peso.split(','))),
'bultos': float("".join(bultos.split(','))),
'consignatario': consignatario,
'embarcador': embarcador})
def parse_categories(self, response):
puerto_llegada = response.meta['puerto_llegada']
pais = response.meta['pais']
bl = response.meta['bl']
peso = response.meta['peso']
bultos = response.meta['bultos']
consignatario = response.meta['consignatario']
embarcador = response.meta['embarcador']
tabla_des= response.xpath('/html/body/form//td[@class="beta"]/table')
trs3= tabla_des.xpath('.//tr')[1:]
for tr3 in trs3:
descripcion= tr.xpath('.//td[7]/text()').extract_first().strip()
yield {'puerto_llegada': puerto_llegada,
'pais': pais,
'bl': bl,
'peso': PROCESOS,
'bultos': bultos,
'consignatario': consignatario,
'embarcador': embarcador,
'descripcion': descripcion}
And I get this error:
ValueError: Missing scheme in request url: javascript:jsDetalle2('154');
Every link that I want to extract data from has that format, so my code for extracting the data inside each link doesn't work.
The link format is like javascript:jsDetalle2('154'), only the numbers change.
The problem is that it isn't http//........ or /manifiesto...... in the first case you only have to follow the link and that's all, in the second case you have to join the second part of the URL with the first response URL. But this case is none, so I don't know how to make it work.
How can I write it in order to work?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我在浏览器中检查了此链接 - 当我单击带有文本
154
的链接时,它会运行带有许多值的POST
,其中之一是'CMc2_NumDet': '154 '
- 这样我就可以从链接中获取此号码并在POST
中使用。在浏览器中您可以看到
'CMc2_Numero': "+++96"
但在代码中您需要space
而不是+
像" 96"
(scrapy 将使用+
而不是space
),或者您可以删除所有+
,如"96"
。顺便说一句:我将所有值放入
meta
作为item: {...}
,这样稍后我就可以使用meta['item'] 一行来获取所有值 完整的工作代码
。
包含类别的页面可能在表中包含许多行(具有您不使用的不同
Peso Bruto
),因此它可能会在 CSV 中提供许多行。如果您只需要一行,则使用 trs3[:1]: 而不是 trs3:
我使用不同的 xpath 来查找包含“Description”的表- 因为以前的版本没有检查表是否有描述,并且它可能会得到 3 个表而不是一个。
结果(带有
trs[:1]
)I checked this link in browser - and when I click link with text
154
then it runsPOST
with many values and one of them is'CMc2_NumDet': '154'
- so I can get this number from link and use inPOST
.In browser you can see
'CMc2_Numero': "+++96"
but in code you needspace
instead of+
like" 96"
(and scrapy will use+
instead ofspace
) or you can remove all+
like"96"
.BTW: I put in
meta
all values asitem: {...}
so later I can get all values using one line withmeta['item']
Full working code.
Page with categories may have many rows in table (with different
Peso Bruto
which you don't use) so it may give many rows in CSV.If you need only one row then use
trs3[:1]:
instead oftrs3:
I used different xpath to find table with
"Descripcion"
- because previous version didn't check if table hasDescripcion
and it could get 3 tables instead of one.Result (with
trs[:1]
)