许多文章的单词列表 - 文档术语矩阵

发布于 2024-08-15 18:37:28 字数 563 浏览 18 评论 0原文

我有近 15 万篇土耳其语文章。我将使用文章进行自然语言处理研究。我想在处理文章后存储每篇文章的单词和频率。

我现在将它们存储在 RDBS 中。

我有 3 个表：

文章 ->文章 ID，文本
词-> word_id、类型、单词
词条-> id，word_id，article_id，频率（word_id的索引，article_id的索引）

我将查询

我在单词-文章表中拥有数百万行的文章。我在这个项目中一直使用 RDBS。开始用mysql，现在用oracle。但我不想使用oracle，想要比mysql更好的性能。

另外，我必须在具有 4GB 内存的机器上处理这项工作。
简单来说，如何存储文档术语矩阵并对其进行查询？性能是必要的。 “键值数据库”在性能上能打败mysql吗？或者什么可以打败mysql？

如果你的答案取决于编程语言，我正在用 python 编写代码。但是C/C++、Java就可以了。

需要登录才能够评论，你可以免费注册一个本站的账号。

淡看悲欢离合 2024-08-22 18:37:28

也许查看 lucene （或 Zend_Search_Lucene 在 php 中）。这是非常好的 FTS 引擎。

靖瑶 2024-08-22 18:37:28

对于 15 万篇文章，words_articles 表中必须有几亿行。只要正确配置 MySQL，这是可以管理的。

一些提示：

For 150k articles, you must have a few hundred million rows in the words_articles table. This is manageable, as long as you configure MySQL properly.

A few tips:

Make sure your tables are MyISAM, not InnoDB.
Drop the id field in the words_articles table and make (word_id, article_id) the primary key. Also, create separate indexes for word_id and article_id in the words_articles table:
```
ALTER TABLE words_articles
DROP PRIMARY KEY,
ADD PRIMARY KEY (word_id, article_id),
ADD INDEX (word_id),
ADD INDEX (article_id);
```
(doing everything in a single alter statement gives much better performance).
Create an index for word in the words table:
```
ALTER TABLE words ADD INDEX (word);
```
Tweak my.cnf. Specifically, increase the buffer sizes (especially key_buffer_size). my-huge.cnf might be a good starting point.

~没有更多了~