Postgresql / Psycopg2 抓取时出现问题 -

发布于 2025-01-15 21:15:25 字数 2651 浏览 2 评论 0原文

我的刮刀遇到了问题，我不知所措，需要你的帮助。

我从 www.racingpost.com 中抓取数据并将数据存储在 Postgresql 数据库中（使用 pgadmin4 来组织它）和 psycopg2 用于连接

直到昨天为止都工作得很好，在那里我想抓取一些更新的数据。

我用一个蜘蛛收集网站的链接，将它们保存在 json 中，用第二个蜘蛛读取 json 并抓取这些链接。我记录错误，以便可以看到问题可能出在哪里。

现在，如果我运行蜘蛛，第二个蜘蛛将无法正确抓取 json 中的所有链接。代码与以前完全相同，并且不是问题，而且 psycopg2 的行为似乎已经改变。

第一个错误发生后：

    ERROR:scrapy.core.scraper:Error processing {'date': ('16.03.2022',), 'track': ('Chantilly (FR)',), 'racename': ('Prix des Ecuries Cantiliennes (Handicap) (4yo+) (All-Weather Track) (Polytrack)',), 'racetype': ('Flat',), 'distance': (1911.1,), 'group': ('',), 'raceclass': (0,), 'classrating': (0,), 'alterteilnehmer': ('4yo+',), 'starterzahl': (13,), 'minalter': (4,), 'maxalter': (99,), 'winningtime': (0,), 'going': ('Standard',), 'finalhurdle': (0,), 'omitted': (0,), 'pricemoney1': (9319.0,), 'pricemoney2': (3727.0,), 'pricemoney3': (2796.0,), 'pricemoney4': (1863.0,), 'pricemoney5': (932.0,), 'pricemoney6': (0.0,), 'pricemoney7': (0.0,), 'pricemoney8': (0.0,), 'racetime': ('3:45',)}
Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\envs\virtual_workspace\envs\py39\lib\site-packages\twisted\internet\defer.py", line 857, in _runCallbacks
    current.result = callback(  # type: ignore[misc]
  File "C:\ProgramData\Anaconda3\envs\virtual_workspace\envs\py39\lib\site-packages\scrapy\utils\defer.py", line 162, in f
    return deferred_from_coro(coro_f(*coro_args, **coro_kwargs))
  File "C:\Users\****\projects\jsontest\jsontest\pipelines.py", line 30, in process_item
    self.store_db(item)
  File "C:\Users\****\projects\jsontest\jsontest\pipelines.py", line 64, in store_db
    self.cur.execute("insert into races(racedate, track, racename, racetype, distancefinal, gruppe, raceclass, classrating, alterteilnehmer, starterzahl, minalter, maxalter, winningtime, going, finalhurdle, omitted, pricemoney1, pricemoney2, pricemoney3, pricemoney4, pricemoney5, pricemoney6, pricemoney7, pricemoney8, racetime) values(%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)",
psycopg2.errors.InFailedSqlTransaction: FEHLER:  aktuelle Transaktion wurde abgebrochen, Befehle werden bis zum Ende der Transaktion ignoriert

从这里开始的所有其他抓取结果都将导致出现如下错误消息。

以前不是这样的，下一个交易（下一个结果）是正常处理的。如果我使用同一个蜘蛛抓取这些“错误”结果，但没有读取 json 文件（因此只有这一个结果），则一切正常。正如我所说 - 三天前这段代码刚刚读取了 165000 个网站，没有出现任何错误。

所以我猜问题不是代码本身 - 什么会在 psycopg2 中引发此类问题？

我通过谷歌搜索错过了任何已知问题吗？
我怎样才能缩小范围？

非常感谢！

原文

I ran into an issue with my scraper and I am at a loss and need you help here.

I scrape data from www.racingpost.com and store the data in a Postgresql database (using pgadmin4 to organize it) and psycopg2 for the connection

This worked perfectly fine until yesterday, where I wanted to scrape some newer data.

I collect the links to the sites with one spider, save them in a json, read the json with the second spider and crawl these links. I log the errors so I can see where a problem might be.

Now, if I run the spider, the second spider does not crawl all the links in the json properly.
The code is exactly the same as before, and is not the issue and it seems the behavior of psycopg2 has changed.

After the first error:

    ERROR:scrapy.core.scraper:Error processing {'date': ('16.03.2022',), 'track': ('Chantilly (FR)',), 'racename': ('Prix des Ecuries Cantiliennes (Handicap) (4yo+) (All-Weather Track) (Polytrack)',), 'racetype': ('Flat',), 'distance': (1911.1,), 'group': ('',), 'raceclass': (0,), 'classrating': (0,), 'alterteilnehmer': ('4yo+',), 'starterzahl': (13,), 'minalter': (4,), 'maxalter': (99,), 'winningtime': (0,), 'going': ('Standard',), 'finalhurdle': (0,), 'omitted': (0,), 'pricemoney1': (9319.0,), 'pricemoney2': (3727.0,), 'pricemoney3': (2796.0,), 'pricemoney4': (1863.0,), 'pricemoney5': (932.0,), 'pricemoney6': (0.0,), 'pricemoney7': (0.0,), 'pricemoney8': (0.0,), 'racetime': ('3:45',)}
Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\envs\virtual_workspace\envs\py39\lib\site-packages\twisted\internet\defer.py", line 857, in _runCallbacks
    current.result = callback(  # type: ignore[misc]
  File "C:\ProgramData\Anaconda3\envs\virtual_workspace\envs\py39\lib\site-packages\scrapy\utils\defer.py", line 162, in f
    return deferred_from_coro(coro_f(*coro_args, **coro_kwargs))
  File "C:\Users\****\projects\jsontest\jsontest\pipelines.py", line 30, in process_item
    self.store_db(item)
  File "C:\Users\****\projects\jsontest\jsontest\pipelines.py", line 64, in store_db
    self.cur.execute("insert into races(racedate, track, racename, racetype, distancefinal, gruppe, raceclass, classrating, alterteilnehmer, starterzahl, minalter, maxalter, winningtime, going, finalhurdle, omitted, pricemoney1, pricemoney2, pricemoney3, pricemoney4, pricemoney5, pricemoney6, pricemoney7, pricemoney8, racetime) values(%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)",
psycopg2.errors.InFailedSqlTransaction: FEHLER:  aktuelle Transaktion wurde abgebrochen, Befehle werden bis zum Ende der Transaktion ignoriert

every other crawled result from here on will result in an error message like this.

This was not the case before, where the next transaction (the next result) was processed normaly. And if I crawl these "error" results with the same spider, but without reading the json file (so only this one result) everything works fine.
And as I said - this code just read 165000 sites woithout a single error three days ago.

So I guess the problem is not the code per se - what could prompt such problems in psycopg2?