MySQL 中的 n 元语法计数
我正在构建一个包含大约 10,000 条记录的 MySQL 数据库。每条记录将包含一个文本文档(大多数情况下是几页文本)。我想在整个数据库中进行各种 n 元语法计数。我已经用 Python 编写了算法,可以处理包含大量文本文件的目录,但要做到这一点,我需要从数据库中提取 10,000 个文本文件 - 这会产生性能问题。
我是 MySQL 的新手,所以我不确定它是否有任何内置功能可以进行 n-gram 分析,或者是否有好的插件可以做到这一点。请注意,在我的分析中,我需要至少达到 4 克(最好是 5 克),因此我见过的简单的 2 克插件在这里不起作用。我还需要能够在进行 n 元语法计数之前从文本文档中删除停用词。
社区有什么想法吗?
谢谢,
罗恩
I am building a MySQL database that will have roughly 10,000 records. Each record will contain a textual document (a few pages of text in most cases). I want to do all sorts of n-gram counting across the entire database. I have algorithms already written in Python that will what I want against a directory containing a large number of text files, but to do that I will need to extract 10,000 text files from the database - this will have performance issues.
I'm a rookie with MySQL, so I'm not sure if it has any built-in features that do n-gram analysis, or whether there are good plugins out there that would do it. Please note that I need to go up to at least 4-grams (preferably 5-grams) in my analysis, so the simple 2-gram plugins I've seen won't work here. I also need to have the ability to remove the stopwords from the textual documents before doing the n-gram counting.
Any ideas from the community?
Thanks,
Ron
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我的建议是使用专门的全文搜索索引程序,如 lucene/solr,它对此类事情有更丰富和可扩展的支持。它需要你学习一些知识来进行设置,但这听起来好像你想要在 MySQL 中难以自定义的级别上进行混乱。
My suggestion would be to use a dedicated full-text search index program like lucene/solr, which has much richer and extensible support for this sort of thing. It will require you to learn a bit to get it set up, but it sounds as if you want to mess around at a level that will be difficult to customize in MySQL.
如果你真的想过早优化;)你可以将你的Python翻译成C,然后用thin包裹它mysql UDF 包装器代码。
但我强烈建议一次加载一个文档,然后在它们上运行 python 脚本来填充 mysql 的 n-gram 表。目前我对每个钉子的锤子是 Django。它的 ORM 使得与 mysql 表交互并优化这些交互变得轻而易举。我使用它在 python 中对必须返回大量数据的 生产站点 的数百万条记录数据库进行统计不到一秒钟。如果你发现比 mysql 更好的东西,比如 postgre,任何 python ORM 都会让你更容易地切换数据库。最好的部分是,有很多 python 和 django 工具可以监控应用程序性能的各个方面(python 执行、mysql 加载/保存、内存/交换)。这样你就可以解决正确的问题。可能是顺序批量 mysql 读取并不是减慢你速度的原因......
If you really want to prematurely optimize ;) you could translate your python into C and then wrap it with thin mysql UDF wrapper code.
But I'd highly recommend just loading your documents one at a time and running your python scripts on them to populate a mysql table of n-grams. My hammer for every nail at the moment is Django. It's ORM makes interacting with mysql tables and optimizing those interactions a cinch. I'm using it to do statistics in python on multimillion record databases for production sites that have to return gobs of data in less than a second. And any python ORM will make it easier to switch out your database if you find something better than mysql, like postgre. The best part is that there are lots of python and django tools to monitor all aspects of your app's performance (python execution, mysql load/save, memory/swap). That way you can attack the right problem. It may be that sequential bulk mysql reads aren't what's slowing you down...