使用 Python 的 SQLite 模块连接比手动连接慢

发布于 2024-09-06 23:56:31 字数 1364 浏览 3 评论 0原文

我正在使用 python 内置的 sqlite3 模块来访问数据库。我的查询在一个包含 150000 个条目的表和一个包含 40000 个条目的表之间执行联接,结果再次包含大约 150000 个条目。如果我在 SQLite Manager 中执行查询,则需要几秒钟,但是如果我从 Python 执行相同的查询,一分钟后它还没有完成。这是我使用的代码:

cursor = self._connection.cursor()
annotationList = cursor.execute("SELECT PrimaryId, GOId " + 
                                "FROM Proteins, Annotations " +
                                "WHERE Proteins.Id = Annotations.ProteinId")
annotations = defaultdict(list)
for protein, goterm in annotationList:
    annotations[protein].append(goterm)

我执行 fetchall 只是为了测量执行时间。有人对性能上的巨大差异有解释吗?我在 Mac OS X 10.6.4 上使用 Python 2.6.1。

我手动实现了连接,而且速度更快。代码如下所示:

cursor = self._connection.cursor()
proteinList = cursor.execute("SELECT Id, PrimaryId FROM Proteins ").fetchall()
annotationList = cursor.execute("SELECT ProteinId, GOId FROM Annotations").fetchall()
proteins = dict(proteinList)
annotations = defaultdict(list)
for protein, goterm in annotationList:
    annotations[proteins[protein]].append(goterm)

因此,当我自己获取表然后在 Python 中进行连接时,大约需要 2 秒。上面的代码需要永远。我在这里错过了什么吗?

我尝试了与 apsw 相同的操作,并且效果很好(代码不需要完全改变),性能很棒。我仍然想知道为什么 sqlite3 模块的速度这么慢。

I am using pythons built-in sqlite3 module to access a database. My query executes a join between a table of 150000 entries and a table of 40000 entries, the result contains about 150000 entries again. If I execute the query in the SQLite Manager it takes a few seconds, but if I execute the same query from Python, it has not finished after a minute. Here is the code I use:

cursor = self._connection.cursor()
annotationList = cursor.execute("SELECT PrimaryId, GOId " + 
                                "FROM Proteins, Annotations " +
                                "WHERE Proteins.Id = Annotations.ProteinId")
annotations = defaultdict(list)
for protein, goterm in annotationList:
    annotations[protein].append(goterm)

I did the fetchall just to measure the execution time. Does anyone have an explanation for the huge difference in performance? I am using Python 2.6.1 on Mac OS X 10.6.4.

I implemented the join manually, and this works much faster. The code looks like this:

cursor = self._connection.cursor()
proteinList = cursor.execute("SELECT Id, PrimaryId FROM Proteins ").fetchall()
annotationList = cursor.execute("SELECT ProteinId, GOId FROM Annotations").fetchall()
proteins = dict(proteinList)
annotations = defaultdict(list)
for protein, goterm in annotationList:
    annotations[proteins[protein]].append(goterm)

So when I fetch the tables myself and then do the join in Python, it takes about 2 seconds. The code above takes forever. Am I missing something here?

I tried the same with apsw, and it works just fine (the code does not need to be changed at all), the performance it great. I'm still wondering why this is so slow with the sqlite3-module.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

他不在意 2024-09-13 23:56:31

这里有一个关于它的讨论: http://www.mail -archive.com/[email protected]/msg253067.html

看来sqlite3模块存在性能瓶颈。有一个建议如何加快查询速度:

  • 确保您在连接列上有索引,
  • 使用 pysqlite

There is a discussion about it here: http://www.mail-archive.com/[email protected]/msg253067.html

It seems that there is a performance bottleneck in the sqlite3 module. There is an advice how to make your queries faster:

  • make sure that you do have indices on the join columns
  • use pysqlite
很酷不放纵 2024-09-13 23:56:31

您尚未发布相关表的架构,但我认为索引可能存在问题,特别是 Proteins.Id 或 Annotations.ProteinId (或两者)上没有索引。

像这样创建 SQLite 索引

CREATE INDEX IF NOT EXISTS index_Proteins_Id ON Proteins (Id)
CREATE INDEX IF NOT EXISTS index_Annotations_ProteinId ON Annotations (ProteinId)

You haven't posted the schema of the tables in question, but I think there might be a problem with indexes, specifically not having an index on Proteins.Id or Annotations.ProteinId (or both).

Create the SQLite indexes like this

CREATE INDEX IF NOT EXISTS index_Proteins_Id ON Proteins (Id)
CREATE INDEX IF NOT EXISTS index_Annotations_ProteinId ON Annotations (ProteinId)
メ斷腸人バ 2024-09-13 23:56:31

我想更新这个,因为我注意到同样的问题,我们现在是 2022 年......

在我自己的应用程序中,我使用 python3 和 sqlite3 在大型数据库上进行一些数据整理(> 100000 行 * > 200 列)。特别是,我注意到我的 3 个表内部联接在 python 中的运行时间约为 12 分钟,而从 CLI 在 sqlite3 中运行相同的联接查询则需要约 100 秒。所有连接谓词都已正确索引,并且 EXPLAIN QUERY PLAN 指示添加的时间很可能是因为我正在使用 SELECT *,这在我的特定上下文中是必要的邪恶。

性能差异让我整晚都抓狂,直到我意识到这里有一个快速解决方案:从命令行运行 Sqlite3 脚本。这绝对是最好的解决方法,但我有研究要做,所以这是我的解决办法。

  1. 将查询写入 .sql 文件(我使用 f-strings 来传递变量,因此我在这里使用了带有 {foo} 的示例)
fi = open("filename.sql", "w")
fi.write(f"CREATE TABLE {Foo} AS SELECT * FROM Table1 INNER JOIN Table2 ON Table2.KeyColumn = Table1.KeyColumn INNER JOIN Table3 ON Table3.KeyColumn = Table1.KeyColumn;")
fi.close()
  1. 从 python 内部运行 os.system 并将 .sql 文件发送到 sqlite3
os.system(f"sqlite3 {database} < filename.sql")

确保关闭在运行此命令之前,请先检查任何打开的连接,这样您就不会最终被锁定,并且如果您要返回到 python 中的 sqlite 中工作,则必须在之后重新实例化任何连接对象。

希望这会有所帮助,如果有人找到了它的来源,请链接到它!

I wanted to update this because I am noticing the same issue and we are now 2022...

In my own application I am using python3 and sqlite3 to do some data wrangling on large databases (>100000 rows * >200 columns). In particular, I have noticed that my 3 table inner join clocks in around ~12 minutes of run time in python, whereas running the same join query in sqlite3 from the CLI runs in ~100 seconds. All the join predicates are properly indexed and the EXPLAIN QUERY PLAN indicates that the added time is most likely because I am using SELECT *, which is a necessary evil in my particular context.

The performance discrepancy caused me to pull my hair out all night until I realized there is a quick fix from here: Running a Sqlite3 Script from Command Line. This is definitely a workaround at best, but I have research due so this is my fix.

  1. Write out the query to an .sql file (I am using f-strings to pass variables in so I used an example with {foo} here)
fi = open("filename.sql", "w")
fi.write(f"CREATE TABLE {Foo} AS SELECT * FROM Table1 INNER JOIN Table2 ON Table2.KeyColumn = Table1.KeyColumn INNER JOIN Table3 ON Table3.KeyColumn = Table1.KeyColumn;")
fi.close()
  1. Run os.system from inside python and send the .sql file to sqlite3
os.system(f"sqlite3 {database} < filename.sql")

Make sure you close any open connection before running this so you don't end up locked out and you'll have to re-instantiate any connection objects afterward if you're going back to working in sqlite within python.

Hope this helps and if anyone has figured the source of this out, please link to it!

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文