为什么解析仅在每个表的第一项上发生

发布于 2025-02-10 15:30:49 字数 2995 浏览 2 评论 0原文

我是Python和Web刮擦的新手，我很乐意一些建议。我创建了蜘蛛，但是JSON输出仅提供每个表的第一个元素。谁能让我知道这是什么原因？

import scrapy

class ActaSpider(scrapy.Spider):
    name = 'acta_spider'
    start_urls = ['https://www.fcf.cat/acta/2022/futbol-11/cadet-primera-divisio/grup-2/1c/la-salle-bonanova-ce-a/1c/lhospitalet-centre-esports-b']
    
    def parse (self, response):
        for actaelements in response.css('table.acta-table'):
            try:
              yield {
                'name' : actaelements.css('a::text').get(),
                'link' : actaelements.css('a').attrib['href'],
            }
            except:
              yield {
                'name' : actaelements.css('a::text').get(),
                'link' : 'Link Error',
            }

我的最终目标是创建一个为每个表创建必要信息的JSON文件：

{
  "DadesPartit":
    {
      "Temporada": "2021-2022",
      "Categoria": "Cadet",
      "Divisio": "Primera",
      "Grup": 2,
      "Jornada": 28
    },
  "TitularsCasa":
    [
      {
        "Nom": "IGNACIO",
        "Cognom":"FERNÁNDEZ ARTOLA",
        "Link": "https://.."
      },
      {
        "Nom": "JAIME",
        "Cognom":"FERNÁNDEZ ARTOLA",
        "Link": "https://.."
      },
      {
        "Nom": "BRUNO",
        "Cognom":"FERRÉ CORREA",
        "Link": "https://.."
      }
      
    ],
  "SuplentsCasa":
    [
      {
        "Nom": " MARC",
        "Cognom":"GIMÉNEZ ABELLA",
        "Link": "https://.."
      }
    ],
  "CosTecnicCasa":
    [
      {
        "Nom": " JORDI",
        "Cognom":"LORENTE VILLENA",
        "Llicencia": "E"
      }
    ],
  "TargetesCasa": 
    [
      {
        "Nom": "IGNACIO",
        "Cognom":"FERNÁNDEZ ARTOLA",
        "Tipus": "Groga",
        "Minut": 65
      }
    ],
  "Arbitres":
    [
      {
        "Nom": " ALEJANDRO",
        "Cognom":"ALVAREZ MOLINA",
        "Delegacio": "Barcelona1"
        
      }
    ],
  "Gols":
    [
      {
        "Nom": "NATXO",
        "Cognom":"MONTERO RAYA",
        "Minut": 5,
        "Tipus": "Gol de penal"
      }
    ],
  "Estadi":
    {
      "Nom": "CAMP DE FUTBOL COL·LEGI LA SALLE BONANOVA,
      "Direccio":"C/ DE SANT JOAN DE LA SALLE, 33, BARCELONA"
    },
    "TitularsFora":
    [
      {
        "Nom": "MARTI",
        "Cognom":"MOLINA MARTIMPE",
        "Link": "https://.."
      },
      {
        "Nom": " XAVIER",
        "Cognom":"MORA AMOR",
        "Link": "https://.."
      },
      {
        "Nom": " IVAN",
        "Cognom":"ARRANZ MORALES",
        "Link": "https://.."
      }
      
    ],
  "SuplentsFora":
    [
      {
        "Nom": "OLIVER",
        "Cognom":"ALCAZAR SANCHEZ",
        "Link": "https://.."
      }
    ],
  "CosTecnicFora":
    [
      {
        "Nom": " RAFAEL",
        "Cognom":"ESPIGARES MARTINEZ",
        "Llicencia": "D"
      }
    ],
  "TargetesFora": 
    [
      {
        "Nom": " ORIOL",
        "Cognom":"ALCOBA LAGE",
        "Tipus": "Groga",
        "Minut": 34
      }
    ]
}

谢谢，琼

原文

I'm new to python and web scraping and I would kindly like some advice. I have created the spider however the json output only provides the first element of each table. Can anyone let me know what is the reason for it?

import scrapy

class ActaSpider(scrapy.Spider):
    name = 'acta_spider'
    start_urls = ['https://www.fcf.cat/acta/2022/futbol-11/cadet-primera-divisio/grup-2/1c/la-salle-bonanova-ce-a/1c/lhospitalet-centre-esports-b']
    
    def parse (self, response):
        for actaelements in response.css('table.acta-table'):
            try:
              yield {
                'name' : actaelements.css('a::text').get(),
                'link' : actaelements.css('a').attrib['href'],
            }
            except:
              yield {
                'name' : actaelements.css('a::text').get(),
                'link' : 'Link Error',
            }

My ultimate goal is to create a JSON file that creates for each table the necessary information:

{
  "DadesPartit":
    {
      "Temporada": "2021-2022",
      "Categoria": "Cadet",
      "Divisio": "Primera",
      "Grup": 2,
      "Jornada": 28
    },
  "TitularsCasa":
    [
      {
        "Nom": "IGNACIO",
        "Cognom":"FERNÁNDEZ ARTOLA",
        "Link": "https://.."
      },
      {
        "Nom": "JAIME",
        "Cognom":"FERNÁNDEZ ARTOLA",
        "Link": "https://.."
      },
      {
        "Nom": "BRUNO",
        "Cognom":"FERRÉ CORREA",
        "Link": "https://.."
      }
      
    ],
  "SuplentsCasa":
    [
      {
        "Nom": " MARC",
        "Cognom":"GIMÉNEZ ABELLA",
        "Link": "https://.."
      }
    ],
  "CosTecnicCasa":
    [
      {
        "Nom": " JORDI",
        "Cognom":"LORENTE VILLENA",
        "Llicencia": "E"
      }
    ],
  "TargetesCasa": 
    [
      {
        "Nom": "IGNACIO",
        "Cognom":"FERNÁNDEZ ARTOLA",
        "Tipus": "Groga",
        "Minut": 65
      }
    ],
  "Arbitres":
    [
      {
        "Nom": " ALEJANDRO",
        "Cognom":"ALVAREZ MOLINA",
        "Delegacio": "Barcelona1"
        
      }
    ],
  "Gols":
    [
      {
        "Nom": "NATXO",
        "Cognom":"MONTERO RAYA",
        "Minut": 5,
        "Tipus": "Gol de penal"
      }
    ],
  "Estadi":
    {
      "Nom": "CAMP DE FUTBOL COL·LEGI LA SALLE BONANOVA,
      "Direccio":"C/ DE SANT JOAN DE LA SALLE, 33, BARCELONA"
    },
    "TitularsFora":
    [
      {
        "Nom": "MARTI",
        "Cognom":"MOLINA MARTIMPE",
        "Link": "https://.."
      },
      {
        "Nom": " XAVIER",
        "Cognom":"MORA AMOR",
        "Link": "https://.."
      },
      {
        "Nom": " IVAN",
        "Cognom":"ARRANZ MORALES",
        "Link": "https://.."
      }
      
    ],
  "SuplentsFora":
    [
      {
        "Nom": "OLIVER",
        "Cognom":"ALCAZAR SANCHEZ",
        "Link": "https://.."
      }
    ],
  "CosTecnicFora":
    [
      {
        "Nom": " RAFAEL",
        "Cognom":"ESPIGARES MARTINEZ",
        "Llicencia": "D"
      }
    ],
  "TargetesFora": 
    [
      {
        "Nom": " ORIOL",
        "Cognom":"ALCOBA LAGE",
        "Tipus": "Groga",
        "Minut": 34
      }
    ]
}

Thanks,
Joan

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

无需解释 2025-02-17 15:30:49

CSS选择器返回匹配元素的列表。由于只有一个元素可以匹配您的查询循环仅执行一次，并且仅重还是第一个链接。您可以进行的一个次要调整是使用XPATH您可以选择桌子的所有孩子，并且您的代码应按预期工作。

只需将您的循环更改为：

for actalements in response.xpath('//table[@class="acta-table"]/*'):

您的其余代码应该按照您的期望工作。

CSS selectors return a list of matching elements. Since there is only one element that matches your query the for loop only executes once and retreives the first link only. One minor adjustment you could make is using xpath you can select all of the children of the table and your code should work as expected.

Simply change your for loop to:

for actalements in response.xpath('//table[@class="acta-table"]/*'):

And the rest of your code should work the way you would expect.

回复收藏 0 原文

夏末的微笑 2025-02-17 15:30:49

之所以发生，是因为您的CSS选择器是错误的，它仅适用于桌子而不是项目。另外，您可以删除尝试以外的，如果是“无”，则可以给链接一个默认值。

import scrapy


class ActaSpider(scrapy.Spider):
    name = 'acta_spider'
    start_urls = ['https://www.fcf.cat/acta/2022/futbol-11/cadet-primera-divisio/grup-2/1c/la-salle-bonanova-ce-a/1c/lhospitalet-centre-esports-b']

    def parse(self, response):
        for actaelements in response.css('table.acta-table tbody tr'):
            yield {
                'name': actaelements.css('a::text').get(),
                'link': actaelements.css('a::attr(href)').get(default='Link Error'),
            }

It happens because your css selector is wrong, it's just for the table and not the items. Also you can remove the try except and give the link a default value if it's "None".

import scrapy


class ActaSpider(scrapy.Spider):
    name = 'acta_spider'
    start_urls = ['https://www.fcf.cat/acta/2022/futbol-11/cadet-primera-divisio/grup-2/1c/la-salle-bonanova-ce-a/1c/lhospitalet-centre-esports-b']

    def parse(self, response):
        for actaelements in response.css('table.acta-table tbody tr'):
            yield {
                'name': actaelements.css('a::text').get(),
                'link': actaelements.css('a::attr(href)').get(default='Link Error'),
            }

回复收藏 0 原文

~没有更多了~