如果 HBase 不是在分布式环境中运行,它还有意义吗?
我正在构建一个数据索引,这将需要以 (文档、术语、权重)
的形式存储大量三元组。我将存储多达几百万个这样的行。目前我正在 MySQL 中将其作为一个简单的表来执行。我将文档和术语标识符存储为字符串值,而不是其他表的外键。我正在重写软件并寻找更好的数据存储方法。
看看 HBase 的工作方式,这似乎非常适合该架构。我可以将 document
映射到 {term =>; ,而不是存储大量三元组。重量}
。
我在单个节点上执行此操作,因此我不关心分布式节点等。我应该坚持使用 MySQL 因为它有效,还是尝试 HBase 是明智之举?我看到 Lucene 使用它进行全文索引(这与我正在做的类似)。我的问题是,单个 HBase 节点与单个 MySQL 节点相比如何?我来自 Scala,那么直接的 Java API 是否比 JDBC 和 MySQL 解析等每个查询有优势?
我主要关心的是插入速度,因为这一直是以前的瓶颈。处理后,我可能最终会将数据放回 MySQL 进行实时查询,因为我需要做一些计算,这些计算最好在 MySQL 中完成。
我会尝试对两者进行原型设计,但我相信社区可以为我提供一些对此有价值的见解。
I'm building an index of data, which will entail storing lots of triplets in the form (document, term, weight)
. I will be storing up to a few million such rows. Currently I'm doing this in MySQL as a simple table. I'm storing the document and term identifiers as string values than foreign keys to other tables. I'm re-writing the software and looking for better ways of storing the data.
Looking at the way HBase works, this seems to fit the schema rather well. Instead of storing lots of triplets, I could map document
to {term => weight}
.
I'm doing this on a single node, so I don't care about distributed nodes etc. Should I just stick with MySQL because it works, or would it be wise to try HBase? I see that Lucene uses it for full-text indexing (which is analogous to what I'm doing). My question is really how would a single HBase node compare with a single MySQL node? I'm coming from Scala, so might a direct Java API have an edge over JDBC and MySQL parsing etc each query?
My primary concern is insertion speed, as that has been the bottleneck previously. After processing, I will probably end up putting the data back into MySQL for live-querying because I need to do some calculations which are better done within MySQL.
I will try prototyping both, but I'm sure the community can give me some valuable insight into this.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
使用适合工作的正确工具。
有很多反 RDBMS 或 BASE 系统(基本可用、软状态、最终一致),而不是 ACID(原子性、一致性、隔离性、持久性)可供选择 此处 和此处。
我使用过传统的 RDBMS,虽然您可以存储 CLOB/BLOB,但它们确实可以
没有专门为搜索这些对象而定制的内置索引。
您想要完成大部分工作(计算加权频率
插入文档时找到的每个元组)。
您可能还想做一些对有用性进行评分的工作
每次搜索后的每个 (documentId,searchWord) 对。
这样您每次都可以提供越来越好的搜索。
您还希望存储每次搜索的分数或权重并进行加权
与其他搜索的相似度得分。
某些搜索可能比其他搜索更常见,并且
用户没有正确表达他们的搜索查询,尽管他们的意思是
进行共同搜索。
插入文档也会导致搜索权重发生一些变化
索引。
我想得越多,解决方案就变得越复杂。
你必须首先从一个好的设计开始。你的因素越多
设计预期越好,结果就越好。
Use the right tool for the job.
There are a lot of anti-RDBMSs or BASE systems (Basically Available, Soft State, Eventually consistent), as opposed to ACID (Atomicity, Consistency, Isolation, Durability) to choose from here and here.
I've used traditional RDBMSs and though you can store CLOBs/BLOBs, they do
not have built-in indexes customized specifically for searching these objects.
You want to do most of the work (calculating the weighted frequency for
each tuple found) when inserting a document.
You might also want to do some work scoring the usefulness of
each (documentId,searchWord) pair after each search.
That way you can give better and better searches each time.
You also want to store a score or weight for each search and weighted
scores for similarity to other searches.
It's likely that some searches are more common than others and that
the users are not phrasing their search query correctly though they mean
to do a common search.
Inserting a document should also cause some change to the search weight
indexes.
The more I think about it, the more complex the solution becomes.
You have to start with a good design first. The more factors your
design anticipates, the better the outcome.
MapReduce 似乎是生成元组的好方法。如果你可以将 scala 作业放入 jar 文件中(不确定,因为我以前没有使用过 scala 并且是 jvm n00b),那么将其发送并编写一些包装器来运行它就很简单了在地图上减少集群。
至于完成后存储元组,您可能还需要考虑基于文档的数据库,例如 mongodb如果你只是存储元组。
一般来说,听起来你正在对文本做一些更统计的事情...你是否考虑过简单地使用 lucene 或 solr 来完成你正在做的事情,而不是自己编写?
MapReduce seems like a great way of generating the tuples. If you can get a scala job into a jar file (not sure since I've not used scala before and am a jvm n00b), it'd be a simply matter to send it along and write a bit of a wrapper to run it on the map reduce cluster.
As for storing the tuples after you're done, you also might want to consider a document based database like mongodb if you're just storing tuples.
In general, it sounds like you're doing something more statistical with the texts... Have you considered simply using lucene or solr to do what you're doing instead of writing your own?