使用 SQL INSERT 激活的记录迭代在第一组 INSERTS 后停止,但使用 PRINT 测试会遍历每条记录(根据需要)
Windows 10 上的 Python 3.10.2 和 sqlite3。对 Python 来说相对较新,但我在 1997-2005 年左右对 Perl 很有经验。
三天来我一直在为此苦苦挣扎。从多个角度进行攻击。没有成功。我正在寻求指导,因为在这一点上,如果没有帮助打开电灯开关,我认为自己无法取得进步。
我的整个应用程序利用 PRAW
来抓取 subreddit 并将提交内容放入表中。然后,我们一次一行地浏览该 reddit 表,扫描 selftext
列中是否有任何 URL。忽略一些。清理别人。将我们想要的插入到另一个表中。稍后,我将浏览该表,下载每个 URL。
我的问题是,如果我运行下面的代码,并将 INSERT
注释掉,我的 print(dl_lnk[0])
行会打印出所有(当前)1,400 的结果来自 reddit 表的行。但是,如果我激活 INSERT 行,它似乎只处理表中的第一行。我可以假设情况确实如此,因为打印行仅显示几行,并且它们都涉及相同的用户和相同/相似的 URL。
我不明白这是为什么。我不认为这是因为 SQL 中的错误(尽管本质上,在代码中访问 SQL 异常的选项似乎比我以前在 Perl 中的要少)。但我也没有看到任何关于我的流逻辑的信息,当 SQL INSERT 发生时,它只处理一行,但当它被注释掉时,它会处理所有行。
def scanForDownloadLinks(db):
cursorObj = db.cursor()
# Get all records from 'reddit' submission table.
for reddit_row in cursorObj.execute('''SELECT name, author, subreddit, permalink, title, selftext FROM reddit ORDER BY CREATED_UTC ASC'''):
# Build URL-matching regex
link_regex = re.compile('((https?):((//)|(\\\\))+([\w\d:#@%/;$()~_?\+-=\\\.&](#!)?)*)', re.DOTALL)
# Find all URLs in this reddit submission's selftext column.
dl_links = re.findall(link_regex, reddit_row[5])
# Go through each link, disregard links we don't want, clean up links we do, then insert them into the dl_link table
for dl_lnk in dl_links:
print(dl_lnk[0])
cursorObj.execute('''INSERT INTO dl_ledger (name, title, dl_link) VALUES(?,?,?)''', (reddit_row[0], str(reddit_row[4]), dl_lnk[0]))
db.commit()
Python 3.10.2 with sqlite3 on Windows 10. Relatively new to Python, but I was quite experienced with Perl around 1997-2005.
Banged my head against this for three days. Attacked from lots of angles. No success. I'm asking for guidance, because I don't see myself progressing on my own at this point without help flipping on the light-switch.
My overall application utilizes PRAW
to scrape a subreddit and put the Submissions into a table. Then, we go through that reddit table one row at a time, scanning the selftext
column for any URLs. Disregard some. Clean up others. Insert the ones we want into another table. At a later point, I'll then go through that table, downloading each URL.
My problem is that if I run the below code, with the INSERT
commented out, my print(dl_lnk[0])
line prints out the results of all (currently) 1,400 rows from the reddit table. But if I activate the INSERT
line, it only seems to process the first row in the table. I can assume this to be the case, because the print line only shows a few lines and they are all regarding the same user and the same/similar URL.
I don't understand why this is. I don't think it's because of an error in SQL (though there seems to be fewer options for accessing SQL exceptions in-code than I used to have in Perl, inherently). But I also don't see anything about my flow logic that would make it process just one row when an SQL INSERT happens but processes through all of them when it's commented out.
def scanForDownloadLinks(db):
cursorObj = db.cursor()
# Get all records from 'reddit' submission table.
for reddit_row in cursorObj.execute('''SELECT name, author, subreddit, permalink, title, selftext FROM reddit ORDER BY CREATED_UTC ASC'''):
# Build URL-matching regex
link_regex = re.compile('((https?):((//)|(\\\\))+([\w\d:#@%/;$()~_?\+-=\\\.&](#!)?)*)', re.DOTALL)
# Find all URLs in this reddit submission's selftext column.
dl_links = re.findall(link_regex, reddit_row[5])
# Go through each link, disregard links we don't want, clean up links we do, then insert them into the dl_link table
for dl_lnk in dl_links:
print(dl_lnk[0])
cursorObj.execute('''INSERT INTO dl_ledger (name, title, dl_link) VALUES(?,?,?)''', (reddit_row[0], str(reddit_row[4]), dl_lnk[0]))
db.commit()
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您应该对
SELECT
和INSERT
查询使用不同的游标。当您使用相同的游标时,INSERT
会重置游标,以便您无法获取剩余的行。或者,您可以使用 CursorObj.fetchall() 以列表形式获取 SELECT 查询的所有结果并循环访问该结果,而不是循环游标本身。但如果有很多行,这将使用大量内存,而循环游标是增量的。但1400行可能不是问题。
You should use different cursors for the
SELECT
andINSERT
queries. When you use the same cursor, theINSERT
resets the cursor so you can't fetch the remaining rows.Or you could use
cursorObj.fetchall()
to get all the results of theSELECT
query as a list and loop through that, rather than looping through the cursor itself. But if there are lots of rows, this will use lots of memory, while looping through the cursor is incremental. But 1400 rows may not be a problem.