SQLite数据库的运行速度非常慢,非常简单。如何提高性能?
对于一个研究项目,我创建了一个存储新闻文章的SQLite数据库。目前,数据库为272GB,并存储在2TB的云体积上。我的云机有32个核心和128GB的RAM,并附有此卷。
我正在运行以下查询: “从文章中select * particles = {}和{}中的源”
在其中,我用一年和大约6个来源代替“ {}”。
运行此查询大约需要1小时,导致DB产生约450k行(在9000万行中)。执行此操作时,CPU使用实际上为0%。
该表是以这种方式创建的: "create table if not exists articles(source_id TEXT, source TEXT, day INTEGER, month INTEGER, year INTEGER, program_name TEXT, transcript TEXT, parliament INTEGER, top1_topic INTEGER, top1_acc REAL, top2_topic INTEGER, top2_acc REAL, top3_topic INTEGER, top3_acc真实,emotionality_nrc真实, emotionality_liwc real,object_codes文本,主键(source_id,day,day,月份,年,program_name));
> 我已经分别索引了来源和一年。
查询解释是: 查询计划` - 搜索文章使用索引IDX_ARTICES_ON_YEAR_SOURCE(年=?and source =?)
我在目录上进行了ioping测试
--- . (ext4 /dev/vdb) ioping statistics ---
99 requests completed in 31.1 ms, 396 KiB read, 3.18 k iops, 12.4 MiB/s
generated 100 requests in 1.65 min, 400 KiB, 1 iops, 4.04 KiB/s
min/avg/max/mdev = 157.4 us / 314.5 us / 477.6 us / 76.8 us
。 -randrepeat = 1 - ioengine = libaio-direct = 1 -gtod_reduce = 1 -name = fiotest -filename = testfio -bs = 4K -Iodepth = 64 -size = 8G -ReadWrite = randrw -rwmixRead = 75 给出了此结果:
read: IOPS=10.8k, BW=42.3MiB/s (44.4MB/s)
write: IOPS=3619, BW=14.1MiB/s (14.8MB/s)
我还尝试了``Pragma Synchronous = Off off ``记忆和WAL等不同的期刊。
我为何数据库如此慢,以及我应该做些什么来提高速度,我有些失落。我在设置中犯了一个愚蠢的错误,还是基础架构不好?我应该切换到数据仓库解决方案,例如Amazon RedShift吗?
PS:我正在通过Pythons sqlite3库连接到数据库,并使用以下代码
def select_articles_by_year_and_sources(self, year, sources=None):
cur = self.conn.cursor()
rows = cur.execute(select_articles_by_year_and_sources_query.format(year, sources))
return iter(ResultIterator(rows))
conn = db.NewsDb(path_db) # connect to database
articles = list(conn.select_articles_by_year_and_sources(year, sources))
conn.close()
我刚刚尝试将8GB文件从附件卷复制到我的VM。 Bash CP命令花了2m和30秒。我想这意味着连接卷的带宽非常慢吗?
For a research project I have created a sqlite database that stores news articles. Currently the database is 272GB big and stored on a cloud volume of 2TB. My cloud machine has 32 Core and 128GB of RAM and is attached to this volume.
I am running the following query:"select * from articles where year={} and source in {}"
in which I replace '{}' with a year and about 6 sources.
Running this query takes about 1h and results in about 450k rows being yielded by the DB (out of 90 million total rows). While doing this, CPU usage is virtually at 0%.
The table has been created this way:"create table if not exists articles(source_id TEXT, source TEXT, day INTEGER, month INTEGER, year INTEGER, program_name TEXT, transcript TEXT, parliament INTEGER, top1_topic INTEGER, top1_acc REAL, top2_topic INTEGER, top2_acc REAL, top3_topic INTEGER, top3_acc REAL, emotionality_nrc REAL, emotionality_liwc REAL, subject_codes TEXT, PRIMARY KEY (source_id, day, month, year, program_name));"
and I have indexed source and year separately.
The query explanation is:QUERY PLAN`--SEARCH articles USING INDEX idx_articles_on_year_source (year=? AND source=?)
I ran an ioping test at the directory the database is stored and got:
--- . (ext4 /dev/vdb) ioping statistics ---
99 requests completed in 31.1 ms, 396 KiB read, 3.18 k iops, 12.4 MiB/s
generated 100 requests in 1.65 min, 400 KiB, 1 iops, 4.04 KiB/s
min/avg/max/mdev = 157.4 us / 314.5 us / 477.6 us / 76.8 us
and the following fio test fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=fiotest --filename=testfio --bs=4k --iodepth=64 --size=8G --readwrite=randrw --rwmixread=75
gave this result:
read: IOPS=10.8k, BW=42.3MiB/s (44.4MB/s)
write: IOPS=3619, BW=14.1MiB/s (14.8MB/s)
I also tried things like ```PRAGMA synchronous=OFF`` and different journals such as memory and WAL.
I am a bit lost on why the database is so slow and what I should do to improve the speed. Have I done a stupid mistake in the setup or is the infrastructure just not good? Should I switch to a data warehouse solution such as amazon redshift?
PS: I am connecting to the db via pythons sqlite3 library and use the following code
def select_articles_by_year_and_sources(self, year, sources=None):
cur = self.conn.cursor()
rows = cur.execute(select_articles_by_year_and_sources_query.format(year, sources))
return iter(ResultIterator(rows))
conn = db.NewsDb(path_db) # connect to database
articles = list(conn.select_articles_by_year_and_sources(year, sources))
conn.close()
I just tried copying a 8GB file from the attach volume to my VM. It took 2m and 30sec with the bash cp command. I guess that means the bandwidth to the attached volume is quite slow?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您的查询计划表明,条款年和
source
- 正在使用条款中的两个列上的索引,因此您可能不会能够加快速度。但是,有可能取决于数据的分布,而不是在文章(source,source)
上使用索引,一个在articles(source,year)上
可能通过更快地修剪更多的行来更好。您可以尝试添加该新索引,然后运行
分析>分析>分析
在数据库中生成有关SQLite用来选择其认为会更好的索引的统计信息的统计信息。检查解释查询计划
输出之后,查看其是在使用新索引还是在旧索引上,然后删除任何索引(或使用新索引,在实践中是否较慢,掉下那个)。另一个选项是使用“ nofollow noreferrer”> sqlite 3命令行计划的
expert> .. expert> .. expert> .. expert> .. expert
>为查询生成索引建议,以查看在这种情况下的索引。
Your query plan shows that the index on the two columns in your
WHERE
clause -year
andsource
- is being used, so you might not be able to speed it up. It's possible, though, that depending on the distribution of your data, instead of having an index onarticles(year, source)
, one onarticles(source, year)
might be better by pruning out more rows faster.You can try adding that new index, and then running an
ANALYZE
on the database to generate statistics about the indexes that SQLite uses to pick which of several possible indexes it thinks will work better. Check theEXPLAIN QUERY PLAN
output after to see if it's using the new index or still on the old one, and then drop whichever index isn't being used (Or if it's slower in practice with the new index, drop that one).Another option is using the sqlite3 command line program's
.expert
command, which generates index suggestions for queries, to see what it comes up with in this case.