SQLite数据库的运行速度非常慢,非常简单。如何提高性能?

发布于 2025-02-12 15:34:10 字数 2006 浏览 0 评论 0原文

对于一个研究项目,我创建了一个存储新闻文章的SQLite数据库。目前,数据库为272GB,并存储在2TB的云体积上。我的云机有32个核心和128GB的RAM,并附有此卷。

我正在运行以下查询: “从文章中select * particles = {}和{}中的源” 在其中,我用一年和大约6个来源代替“ {}”。

运行此查询大约需要1小时,导致DB产生约450k行(在9000万行中)。执行此操作时,CPU使用实际上为0%。

该表是以这种方式创建的: "create table if not exists articles(source_id TEXT, source TEXT, day INTEGER, month INTEGER, year INTEGER, program_name TEXT, transcript TEXT, parliament INTEGER, top1_topic INTEGER, top1_acc REAL, top2_topic INTEGER, top2_acc REAL, top3_topic INTEGER, top3_acc真实,emotionality_nrc真实, emotionality_liwc real,object_codes文本,主键(source_id,day,day,月份,年,program_name));> 我已经分别索引了来源和一年。

查询解释是: 查询计划` - 搜索文章使用索引IDX_ARTICES_ON_YEAR_SOURCE(年=?and source =?)

我在目录上进行了ioping测试

--- . (ext4 /dev/vdb) ioping statistics ---
99 requests completed in 31.1 ms, 396 KiB read, 3.18 k iops, 12.4 MiB/s
generated 100 requests in 1.65 min, 400 KiB, 1 iops, 4.04 KiB/s
min/avg/max/mdev = 157.4 us / 314.5 us / 477.6 us / 76.8 us

。 -randrepeat = 1 - ioengine = libaio-direct = 1 -gtod_reduce = 1 -name = fiotest -filename = testfio -bs = 4K -Iodepth = 64 -size = 8G -ReadWrite = randrw -rwmixRead = 75 给出了此结果:

read: IOPS=10.8k, BW=42.3MiB/s (44.4MB/s)
write: IOPS=3619, BW=14.1MiB/s (14.8MB/s)

我还尝试了``Pragma Synchronous = Off off ``记忆和WAL等不同的期刊。

我为何数据库如此慢,以及我应该做些什么来提高速度,我有些失落。我在设置中犯了一个愚蠢的错误,还是基础架构不好?我应该切换到数据仓库解决方案,例如Amazon RedShift吗?

PS:我正在通过Pythons sqlite3库连接到数据库,并使用以下代码

 def select_articles_by_year_and_sources(self, year, sources=None):
        cur = self.conn.cursor()
        rows = cur.execute(select_articles_by_year_and_sources_query.format(year, sources))

        return iter(ResultIterator(rows))

conn = db.NewsDb(path_db) # connect to database
articles = list(conn.select_articles_by_year_and_sources(year, sources))
conn.close()

我刚刚尝试将8GB文件从附件卷复制到我的VM。 Bash CP命令花了2m和30秒。我想这意味着连接卷的带宽非常慢吗?

For a research project I have created a sqlite database that stores news articles. Currently the database is 272GB big and stored on a cloud volume of 2TB. My cloud machine has 32 Core and 128GB of RAM and is attached to this volume.

I am running the following query:
"select * from articles where year={} and source in {}"
in which I replace '{}' with a year and about 6 sources.

Running this query takes about 1h and results in about 450k rows being yielded by the DB (out of 90 million total rows). While doing this, CPU usage is virtually at 0%.

The table has been created this way:
"create table if not exists articles(source_id TEXT, source TEXT, day INTEGER, month INTEGER, year INTEGER, program_name TEXT, transcript TEXT, parliament INTEGER, top1_topic INTEGER, top1_acc REAL, top2_topic INTEGER, top2_acc REAL, top3_topic INTEGER, top3_acc REAL, emotionality_nrc REAL, emotionality_liwc REAL, subject_codes TEXT, PRIMARY KEY (source_id, day, month, year, program_name));"
and I have indexed source and year separately.

The query explanation is:
QUERY PLAN`--SEARCH articles USING INDEX idx_articles_on_year_source (year=? AND source=?)

I ran an ioping test at the directory the database is stored and got:

--- . (ext4 /dev/vdb) ioping statistics ---
99 requests completed in 31.1 ms, 396 KiB read, 3.18 k iops, 12.4 MiB/s
generated 100 requests in 1.65 min, 400 KiB, 1 iops, 4.04 KiB/s
min/avg/max/mdev = 157.4 us / 314.5 us / 477.6 us / 76.8 us

and the following fio test fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=fiotest --filename=testfio --bs=4k --iodepth=64 --size=8G --readwrite=randrw --rwmixread=75 gave this result:

read: IOPS=10.8k, BW=42.3MiB/s (44.4MB/s)
write: IOPS=3619, BW=14.1MiB/s (14.8MB/s)

I also tried things like ```PRAGMA synchronous=OFF`` and different journals such as memory and WAL.

I am a bit lost on why the database is so slow and what I should do to improve the speed. Have I done a stupid mistake in the setup or is the infrastructure just not good? Should I switch to a data warehouse solution such as amazon redshift?

PS: I am connecting to the db via pythons sqlite3 library and use the following code

 def select_articles_by_year_and_sources(self, year, sources=None):
        cur = self.conn.cursor()
        rows = cur.execute(select_articles_by_year_and_sources_query.format(year, sources))

        return iter(ResultIterator(rows))

conn = db.NewsDb(path_db) # connect to database
articles = list(conn.select_articles_by_year_and_sources(year, sources))
conn.close()

I just tried copying a 8GB file from the attach volume to my VM. It took 2m and 30sec with the bash cp command. I guess that means the bandwidth to the attached volume is quite slow?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

简单爱 2025-02-19 15:34:10

您的查询计划表明,条款年和source - 正在使用条款中的两个列上的索引,因此您可能不会能够加快速度。但是,有可能取决于数据的分布,而不是在文章(source,source)上使用索引,一个在articles(source,year)上可能通过更快地修剪更多的行来更好。

您可以尝试添加该新索引,然后运行 分析>分析>分析 在数据库中生成有关SQLite用来选择其认为会更好的索引的统计信息的统计信息。检查解释查询计划输出之后,查看其是在使用新索引还是在旧索引上,然后删除任何索引(或使用新索引,在实践中是否较慢,掉下那个)。

另一个选项是使用“ nofollow noreferrer”> sqlite 3命令行计划的expert> .. expert> .. expert> .. expert> .. expert>为查询生成索引建议,以查看在这种情况下的索引。

Your query plan shows that the index on the two columns in your WHERE clause - year and source - is being used, so you might not be able to speed it up. It's possible, though, that depending on the distribution of your data, instead of having an index on articles(year, source), one on articles(source, year) might be better by pruning out more rows faster.

You can try adding that new index, and then running an ANALYZE on the database to generate statistics about the indexes that SQLite uses to pick which of several possible indexes it thinks will work better. Check the EXPLAIN QUERY PLAN output after to see if it's using the new index or still on the old one, and then drop whichever index isn't being used (Or if it's slower in practice with the new index, drop that one).

Another option is using the sqlite3 command line program's .expert command, which generates index suggestions for queries, to see what it comes up with in this case.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文