使用 Python 的 SQLite 模块连接比手动连接慢
我正在使用 python 内置的 sqlite3 模块来访问数据库。我的查询在一个包含 150000 个条目的表和一个包含 40000 个条目的表之间执行联接,结果再次包含大约 150000 个条目。如果我在 SQLite Manager 中执行查询,则需要几秒钟,但是如果我从 Python 执行相同的查询,一分钟后它还没有完成。这是我使用的代码:
cursor = self._connection.cursor()
annotationList = cursor.execute("SELECT PrimaryId, GOId " +
"FROM Proteins, Annotations " +
"WHERE Proteins.Id = Annotations.ProteinId")
annotations = defaultdict(list)
for protein, goterm in annotationList:
annotations[protein].append(goterm)
我执行 fetchall
只是为了测量执行时间。有人对性能上的巨大差异有解释吗?我在 Mac OS X 10.6.4 上使用 Python 2.6.1。
我手动实现了连接,而且速度更快。代码如下所示:
cursor = self._connection.cursor()
proteinList = cursor.execute("SELECT Id, PrimaryId FROM Proteins ").fetchall()
annotationList = cursor.execute("SELECT ProteinId, GOId FROM Annotations").fetchall()
proteins = dict(proteinList)
annotations = defaultdict(list)
for protein, goterm in annotationList:
annotations[proteins[protein]].append(goterm)
因此,当我自己获取表然后在 Python 中进行连接时,大约需要 2 秒。上面的代码需要永远。我在这里错过了什么吗?
我尝试了与 apsw 相同的操作,并且效果很好(代码不需要完全改变),性能很棒。我仍然想知道为什么 sqlite3
模块的速度这么慢。
I am using pythons built-in sqlite3
module to access a database. My query executes a join between a table of 150000 entries and a table of 40000 entries, the result contains about 150000 entries again. If I execute the query in the SQLite Manager it takes a few seconds, but if I execute the same query from Python, it has not finished after a minute. Here is the code I use:
cursor = self._connection.cursor()
annotationList = cursor.execute("SELECT PrimaryId, GOId " +
"FROM Proteins, Annotations " +
"WHERE Proteins.Id = Annotations.ProteinId")
annotations = defaultdict(list)
for protein, goterm in annotationList:
annotations[protein].append(goterm)
I did the fetchall
just to measure the execution time. Does anyone have an explanation for the huge difference in performance? I am using Python 2.6.1 on Mac OS X 10.6.4.
I implemented the join manually, and this works much faster. The code looks like this:
cursor = self._connection.cursor()
proteinList = cursor.execute("SELECT Id, PrimaryId FROM Proteins ").fetchall()
annotationList = cursor.execute("SELECT ProteinId, GOId FROM Annotations").fetchall()
proteins = dict(proteinList)
annotations = defaultdict(list)
for protein, goterm in annotationList:
annotations[proteins[protein]].append(goterm)
So when I fetch the tables myself and then do the join in Python, it takes about 2 seconds. The code above takes forever. Am I missing something here?
I tried the same with apsw, and it works just fine (the code does not need to be changed at all), the performance it great. I'm still wondering why this is so slow with the sqlite3
-module.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
这里有一个关于它的讨论: http://www.mail -archive.com/[email protected]/msg253067.html
看来sqlite3模块存在性能瓶颈。有一个建议如何加快查询速度:
There is a discussion about it here: http://www.mail-archive.com/[email protected]/msg253067.html
It seems that there is a performance bottleneck in the sqlite3 module. There is an advice how to make your queries faster:
您尚未发布相关表的架构,但我认为索引可能存在问题,特别是 Proteins.Id 或 Annotations.ProteinId (或两者)上没有索引。
像这样创建 SQLite 索引
You haven't posted the schema of the tables in question, but I think there might be a problem with indexes, specifically not having an index on Proteins.Id or Annotations.ProteinId (or both).
Create the SQLite indexes like this
我想更新这个,因为我注意到同样的问题,我们现在是 2022 年......
在我自己的应用程序中,我使用 python3 和 sqlite3 在大型数据库上进行一些数据整理(> 100000 行 * > 200 列)。特别是,我注意到我的 3 个表内部联接在 python 中的运行时间约为 12 分钟,而从 CLI 在 sqlite3 中运行相同的联接查询则需要约 100 秒。所有连接谓词都已正确索引,并且 EXPLAIN QUERY PLAN 指示添加的时间很可能是因为我正在使用 SELECT *,这在我的特定上下文中是必要的邪恶。
性能差异让我整晚都抓狂,直到我意识到这里有一个快速解决方案:从命令行运行 Sqlite3 脚本。这绝对是最好的解决方法,但我有研究要做,所以这是我的解决办法。
确保关闭在运行此命令之前,请先检查任何打开的连接,这样您就不会最终被锁定,并且如果您要返回到 python 中的 sqlite 中工作,则必须在之后重新实例化任何连接对象。
希望这会有所帮助,如果有人找到了它的来源,请链接到它!
I wanted to update this because I am noticing the same issue and we are now 2022...
In my own application I am using python3 and sqlite3 to do some data wrangling on large databases (>100000 rows * >200 columns). In particular, I have noticed that my 3 table inner join clocks in around ~12 minutes of run time in python, whereas running the same join query in sqlite3 from the CLI runs in ~100 seconds. All the join predicates are properly indexed and the EXPLAIN QUERY PLAN indicates that the added time is most likely because I am using SELECT *, which is a necessary evil in my particular context.
The performance discrepancy caused me to pull my hair out all night until I realized there is a quick fix from here: Running a Sqlite3 Script from Command Line. This is definitely a workaround at best, but I have research due so this is my fix.
Make sure you close any open connection before running this so you don't end up locked out and you'll have to re-instantiate any connection objects afterward if you're going back to working in sqlite within python.
Hope this helps and if anyone has figured the source of this out, please link to it!