接近 MySQL 中的串行文本文件读取性能

发布于 2024-10-06 16:56:14 字数 539 浏览 3 评论 0原文

我正在尝试在 python 中执行一些 n-gram 计数,我想我可以使用 MySQL(MySQLdb 模块)来组织我的文本数据。

我有一个相当大的表,大约有 1000 万条记录,表示由唯一数字 id(自动增量)和语言 varchar 字段(例如“en”、“de”、“ es" 等..)

select * from table 太慢并且内存破坏。 我最终将整个 id 范围分割成较小的范围(例如每个范围 2000 条记录),并使用以下查询逐一处理这些较小的记录集:

select * from table where id >= 1 and id <= 1999
select * from table where id >= 2000 and id <= 2999

等等...

有没有什么方法可以使用 MySQL 更有效地做到这一点并获得与串行读取大语料库文本文件类似的性能?

我不关心记录的顺序,我只是希望能够处理我的大表中属于某种语言的所有文档。

I am trying to perform some n-gram counting in python and I thought I could use MySQL (MySQLdb module) for organizing my text data.

I have a pretty big table, around 10mil records, representing documents that are indexed by a unique numeric id (auto-increment) and by a language varchar field (e.g. "en", "de", "es" etc..)

select * from table is too slow and memory devastating.
I ended up splitting the whole id range into smaller ranges (say 2000 records wide each) and processing each of those smaller record sets one by one with queries like:

select * from table where id >= 1 and id <= 1999
select * from table where id >= 2000 and id <= 2999

and so on...

Is there any way to do it more efficiently with MySQL and achieve similar performance to reading a big corpus text file serially?

I don't care about the ordering of the records, I just want to be able to process all the documents that pertain to a certain language in my big table.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

神魇的王 2024-10-13 16:56:14

您可以使用 HANDLER 语句分块遍历表(或索引)。这不是很可移植,并且如果在您查看行时出现和消失,则以“有趣”的方式处理事务(提示:您不会获得一致性),但对于某些应用程序来说,代码会更简单。

一般来说,您的性能将会受到影响,就好像您的数据库服务器位于计算机本地一样,需要多个数据副本(在内存中)以及一些其他处理。这是不可避免的,如果它真的让你烦恼,你就不应该使用 mysql 来达到这个目的。

You can use the HANDLER statement to traverse a table (or index) in chunks. This is not very portable and works in an "interesting" way with transactions if rows appear and disappear while you're looking at it (hint: you're not going to get consistency) but makes code simpler for some applications.

In general, you are going to get a performance hit, as if your database server is local to the machine, several copies of the data will be necessary (in memory) as well as some other processing. This is unavoidable, and if it really bothers you, you shouldn't use mysql for this purpose.

少女七分熟 2024-10-13 16:56:14

除了在用于过滤查询的任何列上定义索引(可能是语言和 ID,其中 ID 已经具有主键的索引关心)之外,没有。

Aside from having indexes defined on whatever columns you're using to filter the query (language and ID probably, where ID already has an index care of the primary key), no.

世界如花海般美丽 2024-10-13 16:56:14

首先:如果您可以指定所需的列(在本例中为 lang 和 doc),则应该避免使用 *。第二:除非您经常更改数据,否则我认为存储所有数据没有意义
这在数据库中,特别是当您存储文件名时。例如,您可以使用 xml 格式(并使用 SAX api 进行读/写)。

如果您想要一个 DB 和比 MySQL 更快的东西,您可以考虑内存数据库,例如 SQLite 或 BerkeleyDb,它们都有 python 绑定。

问候,
J。

First: you should avoid using * if you can specify the columns you need (lang and doc in this case). Second: unless you change your data very often, I don't see the point of storing all
this in a database, especially if you are storing file names. You could use an xml format for example (and read/write with a SAX api)

If you want a DB and something faster than MySQL, you can consider an in-memory databasy such as SQLite or BerkeleyDb, which have both python bindings.

Greetz,
J.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文