NoSQL 是解决这个特定数据库问题的最佳选择吗
我有一个问题,我认为 NoSQL 解决方案就是答案,但我不确定。另外,我不确定哪种类型的 NoSQL DB(对象、文档、图形、键等)最适合解决这个问题。
问题:
我有两个集合。 CollectionA包含2K+字符串(域名)。 CollectionB 更大,看起来(伪)如下所示:
{
"To" : "[email protected],[email protected],there_could_be_100@more_address.com",
"Bcc" : "[email protected],[email protected],there_could_be_100@more_address.com",
"From" : "[email protected],[email protected],there_could_be_100@more_address.com",
"Subject" : "Email Subject",
"Unknown" : "NumberOfFields",
"N" : "PlusOneExtraFields",
}
Knowns:
- To、Bcc 和 From 字符串中可能列出了 100 个人。
- 我没有好的方法来分解“收件人”、“发件人”、“密件抄送”字段。
- 如果没有办法分解“收件人”、“发件人”、“密件抄送”字段,我就被迫搜索字符串。
- 有一些已知的领域,但有许多未知的领域。
- 这些要求当前不要求跨未知字段进行搜索。
- 数据库引擎需要在 Windows 桌面上运行。
当前思路:
使用 NoSQL 解决方案和也许 C# 动态关键字?
模糊
这是一个文档数据库可以轻松解决的问题吗?
在这种类型的数据结构中搜索/比较是否适合 Map/Reduce?
I have a problem and I think a NoSQL solution is the answer but I am not sure. Also, I am not sure what type of NoSQL DB (Object,Document,Graph,Key,etc) would be best suited to solve this problem.
Problem:
I have two collections. CollectionA contains 2K+ strings (domain names). CollectionB is much larger and looks (pseudo) like this:
{
"To" : "[email protected],[email protected],there_could_be_100@more_address.com",
"Bcc" : "[email protected],[email protected],there_could_be_100@more_address.com",
"From" : "[email protected],[email protected],there_could_be_100@more_address.com",
"Subject" : "Email Subject",
"Unknown" : "NumberOfFields",
"N" : "PlusOneExtraFields",
}
Knowns:
- There can be 100s of people listed in the To, Bcc, and From strings.
- I don't have a good way to explode the To, From, Bcc fields.
- Without a way to explode the To, From, Bcc fields I am forced to search strings.
- There are a few known fields but many unknown fields.
- The requirements don't currently call for searching across the unknown fields.
- The database engine needs to run on a windows desktop.
Current line of thinking:
Using a NoSQL solution and maybe the C# dynamic keyword?
Fuzzy
Is this a problem that is easitly solved by a document database?
Is searching/comparing across this type of data structure something that for Map/Reduce?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我完全同意@HighTechRider,如果数据量如您所暗示的那么大,那么在这种情况下,数据的非规范化(如您所说的那样爆炸)似乎是高性能查询的必须,否则您选择什么产品并不重要,它最终将成为某种时尚或其他时尚的自由文本扫描。
@chx 对 Sphinx 的建议似乎至少在加速后者方面是合理的。但这条路线存在隐性成本——需要您将其他人的服务与您的软件捆绑、安装、管理、修补、更新等。
最大限度地减少索引和查询中的桌面资源消耗必须成为高优先级,而在桌面上设置自由文本服务器似乎有点违反这一章程。
我将从基本文件系统开始 - 使用文件系统对象来表示非规范化数据。或者,如果表示和执行查询看起来太复杂,请先查看简单的嵌入式表库,例如 SQLite 或 SQL Compact 版本,然后再尝试将更奇特的针对服务器的产品硬塞到桌面上。
SQLite 与 SQL Compact Edition 的比较:
http://www.tech-archive.net/Archive/DotNet/microsoft.public.dotnet.framework.compactframework/2005-12/msg00019.html
SQLite 还可以创建自由文本涵盖未来一些“未知领域”场景的索引。
至于map-reduce,它的策略对于您正在接近的域是有效的。
I totally agree with @HighTechRider, denormalization of data (exploding as you put it) seems a must in this instance for performant queries if the volume of data is as large as you imply, else it doesn't matter what product you pick, it'll end up being a free-text scan of some fashion or other.
@chx's suggestion of Sphinx, seems plausible in at least accelerating the latter. But there are hidden costs to that route - needing you to bundle, install, manage, patch, update etc. someone else's service alongside your software.
Minimizing desktop resource consumption in indexing and query have to be high priority, and setting up a free-text server on a desktop seems somewhat contra that charter.
I'd start with either basic file-system - using filesystem objects to represent your denormalized data. Or if representing and executing your queries seems too complex, look at simple embedded table libraries like SQLite or SQL Compact edition before trying shoehorn more exotic server-targetted products onto the desktop.
Nice comparison of SQLite vs. SQL Compact Edition here:
http://www.tech-archive.net/Archive/DotNet/microsoft.public.dotnet.framework.compactframework/2005-12/msg00019.html
SQLite can also create free-text indexes that cover some of your "unknown field" scenarios in future.
As for map-reduce, it's strategy is valid for the domain you're approaching.
以 XML 形式存储并使用 sphinx 进行搜索。使用 xmlpipe2 通过 grep 等方式向 sphinx 提供数据,仅将已知字段提供给其中。一旦您需要搜索更多内容,请将这些字段添加到您的过滤器和架构中并重新索引。 Sphinx 可以以这样的速度进行索引,这不会造成任何真正的问题。也可以分发。
您要求进行文本搜索,这意味着 solr 或 sphinx 以及这两个 sphinx 之间的设置在 Windows 桌面上更容易。
Store in XML and search with sphinx. Use xmlpipe2 to feed sphinx through something like grep to feed only the known fields into it. Once you need to search on more, add those fields to your filter and the schema and reindex. Sphinx can index at such speeds this poses no real problem. Can be distributed too.
You are calling for text search, well, that means solr or sphinx and between the two sphinx is waaaaaaaaay easier to set up on a Windows desktop.
我觉得这是 Apache lucene.net 的合适人选。
你可以像这样为上面指定的结构创建一个lucene文档
,但是lucene的问题是你不能在以后添加新的字段或修改现有的字段结构。因此,您必须删除文档并从头开始创建新文档。
更好的方法是让所有字段都可以为未知字段建立索引。
I feel this is a right candidate for Apache lucene.net .
You can create a lucene document for the above specified structure like this
But the problem with lucene is you cannot add new field or modify the existing field structure at later time. So you have to delete the documents and create the new ones from scracth.
Better approach would be make all your fields indexable for the unknown fields.
不,事实并非如此。它是全文搜索引擎的候选者,与“nosql”无关,无论它是什么。
全文搜索引擎通常使用 SQL 或其某些变体。例如,Sphinx 或 Lucene。你也可以使用微软的(但我不知道这是否能满足你的要求,你需要检查一下)。
No, it is not. It is a candidate for a full-text search engine, which is nothing to do with "nosql", whatever that is.
Full-text search engines often use SQL or some variant of it. For example, Sphinx or Lucene. You could also use Microsoft's one (but I don't know if that will satisfy your requirements, you need to check).