SQLite数据库的运行速度非常慢，非常简单。如何提高性能？

发布于 2025-02-12 15:34:10 字数 2006 浏览 0 评论 0原文

对于一个研究项目，我创建了一个存储新闻文章的SQLite数据库。目前，数据库为272GB，并存储在2TB的云体积上。我的云机有32个核心和128GB的RAM，并附有此卷。

我正在运行以下查询： “从文章中select * particles = {}和{}中的源” 在其中，我用一年和大约6个来源代替“ {}”。

运行此查询大约需要1小时，导致DB产生约450k行（在9000万行中）。执行此操作时，CPU使用实际上为0％。

该表是以这种方式创建的： "create table if not exists articles(source_id TEXT, source TEXT, day INTEGER, month INTEGER, year INTEGER, program_name TEXT, transcript TEXT, parliament INTEGER, top1_topic INTEGER, top1_acc REAL, top2_topic INTEGER, top2_acc REAL, top3_topic INTEGER, top3_acc真实，emotionality_nrc真实， emotionality_liwc real，object_codes文本，主键（source_id，day，day，月份，年，program_name））;> 我已经分别索引了来源和一年。

查询解释是：查询计划` - 搜索文章使用索引IDX_ARTICES_ON_YEAR_SOURCE（年=？and source =？）

我在目录上进行了ioping测试

--- . (ext4 /dev/vdb) ioping statistics ---
99 requests completed in 31.1 ms, 396 KiB read, 3.18 k iops, 12.4 MiB/s
generated 100 requests in 1.65 min, 400 KiB, 1 iops, 4.04 KiB/s
min/avg/max/mdev = 157.4 us / 314.5 us / 477.6 us / 76.8 us

。 -randrepeat = 1 - ioengine = libaio-direct = 1 -gtod_reduce = 1 -name = fiotest -filename = testfio -bs = 4K -Iodepth = 64 -size = 8G -ReadWrite = randrw -rwmixRead = 75 给出了此结果：

read: IOPS=10.8k, BW=42.3MiB/s (44.4MB/s)
write: IOPS=3619, BW=14.1MiB/s (14.8MB/s)

我还尝试了``Pragma Synchronous = Off off ``记忆和WAL等不同的期刊。

我为何数据库如此慢，以及我应该做些什么来提高速度，我有些失落。我在设置中犯了一个愚蠢的错误，还是基础架构不好？我应该切换到数据仓库解决方案，例如Amazon RedShift吗？

PS：我正在通过Pythons sqlite3库连接到数据库，并使用以下代码

 def select_articles_by_year_and_sources(self, year, sources=None):
        cur = self.conn.cursor()
        rows = cur.execute(select_articles_by_year_and_sources_query.format(year, sources))

        return iter(ResultIterator(rows))

conn = db.NewsDb(path_db) # connect to database
articles = list(conn.select_articles_by_year_and_sources(year, sources))
conn.close()

我刚刚尝试将8GB文件从附件卷复制到我的VM。 Bash CP命令花了2m和30秒。我想这意味着连接卷的带宽非常慢吗？

原文

For a research project I have created a sqlite database that stores news articles. Currently the database is 272GB big and stored on a cloud volume of 2TB. My cloud machine has 32 Core and 128GB of RAM and is attached to this volume.

I am running the following query:
"select * from articles where year={} and source in {}"
in which I replace '{}' with a year and about 6 sources.

Running this query takes about 1h and results in about 450k rows being yielded by the DB (out of 90 million total rows). While doing this, CPU usage is virtually at 0%.

The table has been created this way:
"create table if not exists articles(source_id TEXT, source TEXT, day INTEGER, month INTEGER, year INTEGER, program_name TEXT, transcript TEXT, parliament INTEGER, top1_topic INTEGER, top1_acc REAL, top2_topic INTEGER, top2_acc REAL, top3_topic INTEGER, top3_acc REAL, emotionality_nrc REAL, emotionality_liwc REAL, subject_codes TEXT, PRIMARY KEY (source_id, day, month, year, program_name));"
and I have indexed source and year separately.

The query explanation is:
QUERY PLAN`--SEARCH articles USING INDEX idx_articles_on_year_source (year=? AND source=?)

I ran an ioping test at the directory the database is stored and got:

--- . (ext4 /dev/vdb) ioping statistics ---
99 requests completed in 31.1 ms, 396 KiB read, 3.18 k iops, 12.4 MiB/s
generated 100 requests in 1.65 min, 400 KiB, 1 iops, 4.04 KiB/s
min/avg/max/mdev = 157.4 us / 314.5 us / 477.6 us / 76.8 us

and the following fio test fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=fiotest --filename=testfio --bs=4k --iodepth=64 --size=8G --readwrite=randrw --rwmixread=75 gave this result:

read: IOPS=10.8k, BW=42.3MiB/s (44.4MB/s)
write: IOPS=3619, BW=14.1MiB/s (14.8MB/s)

I also tried things like ```PRAGMA synchronous=OFF`` and different journals such as memory and WAL.

I am a bit lost on why the database is so slow and what I should do to improve the speed. Have I done a stupid mistake in the setup or is the infrastructure just not good? Should I switch to a data warehouse solution such as amazon redshift?

PS: I am connecting to the db via pythons sqlite3 library and use the following code

 def select_articles_by_year_and_sources(self, year, sources=None):
        cur = self.conn.cursor()
        rows = cur.execute(select_articles_by_year_and_sources_query.format(year, sources))

        return iter(ResultIterator(rows))

conn = db.NewsDb(path_db) # connect to database
articles = list(conn.select_articles_by_year_and_sources(year, sources))
conn.close()

I just tried copying a 8GB file from the attach volume to my VM. It took 2m and 30sec with the bash cp command. I guess that means the bandwidth to the attached volume is quite slow?

分享到QQ

分享到微博