Web Crawler - 使用 Scrapy 迭代 Postgres 数据库结果
我正在尝试编写一个 scraper 从数据库结果中获取域。我能够从数据库获取数据,但我不知道如何将其提供给 Scrapy。我在这里查看并找到了很多建议,但没有一个是我真正在做的。当我运行下面的代码时,没有任何反应,甚至没有错误。
scaper.py
#import json
import json
#import database library
import psycopg2
#import scrapy library
import scrapy
#create database connection
conn = psycopg2.connect(
host="localhost",
database="mydb",
user="dbuser",
password="postgres",
port=5432
)
#create cursor from database
#cursor() is python equivalent to query() to fetch the rows
query = conn.cursor()
#execute query from database
query.execute('SELECT info FROM domains')
#create scrapy class
class MySpider(scrapy.Spider):
name = "scrap_domains"
#start_requests with scrapy
def start_requests(self):
#iterate over database result
for url in query:
#iterate over each json object
for item in url:
#get domain name
domain_name = item['domain']
#grab information from url
yield scrapy.Request()
#print response
def parse(self, response):
print(response)
# we close the cursor and conn both
query.close()
conn.close()
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我终于让我的刮刀工作了。该问题是由于每次迭代时关闭游标和数据库连接引起的。正如我一直在学习的,Python 不像 Node 那样是异步的。应该编写一个函数来检测迭代何时完成,然后继续执行进一步的任务,但为了本示例的目的,我们只是像在文件底部那样将它们注释掉。我正在发布详细答案以供将来参考。
我正在使用 PostgreSQL 并将数据存储在 JSONB 中。我的表格只有 2 列,如下所示:
根据 scrapy 文档,复制/粘贴下面的代码并在终端中运行此命令,将所有域写入 json 文件:
使用 选择器,用于从正文中提取 HTML 数据
scraper.py
domains.json(示例输出)
I finally got my scraper working. The problem was caused by closing the cursor and database connection on every iteration. Python is not async like Node, as I've been learning. A function should be written to detect when the iteration is finished then proceed with further tasks but for the purpose of this example, we just comment them out like we did at the bottom of the file. I'm posting a detailed answer for future references.
I'm using PostgreSQL and store the data in JSONB. My table only has 2 columns and looks like this :
As per the scrapy documents, copy/paste codes below and run this command in your terminal to write all domains to a json file :
Use the Selectors to extract HTML data from the body
scraper.py
domains.json (example output)