接近 MySQL 中的串行文本文件读取性能

发布于 2024-10-06 16:56:14 字数 539 浏览 3 评论 0原文

我正在尝试在 python 中执行一些 n-gram 计数，我想我可以使用 MySQL（MySQLdb 模块）来组织我的文本数据。

我有一个相当大的表，大约有 1000 万条记录，表示由唯一数字 id（自动增量）和语言 varchar 字段（例如“en”、“de”、“ es" 等..)

select * from table 太慢并且内存破坏。我最终将整个 id 范围分割成较小的范围（例如每个范围 2000 条记录），并使用以下查询逐一处理这些较小的记录集：

select * from table where id >= 1 and id <= 1999
select * from table where id >= 2000 and id <= 2999

等等...

有没有什么方法可以使用 MySQL 更有效地做到这一点并获得与串行读取大语料库文本文件类似的性能？

我不关心记录的顺序，我只是希望能够处理我的大表中属于某种语言的所有文档。

原文

I am trying to perform some n-gram counting in python and I thought I could use MySQL (MySQLdb module) for organizing my text data.

I have a pretty big table, around 10mil records, representing documents that are indexed by a unique numeric id (auto-increment) and by a language varchar field (e.g. "en", "de", "es" etc..)

select * from table is too slow and memory devastating.
I ended up splitting the whole id range into smaller ranges (say 2000 records wide each) and processing each of those smaller record sets one by one with queries like:

select * from table where id >= 1 and id <= 1999
select * from table where id >= 2000 and id <= 2999

and so on...

Is there any way to do it more efficiently with MySQL and achieve similar performance to reading a big corpus text file serially?

I don't care about the ordering of the records, I just want to be able to process all the documents that pertain to a certain language in my big table.

分享到QQ

分享到微博