NoSQL 是解决这个特定数据库问题的最佳选择吗

发布于 2024-10-09 21:25:56 字数 1844 浏览 2 评论 0原文

我有一个问题，我认为 NoSQL 解决方案就是答案，但我不确定。另外，我不确定哪种类型的 NoSQL DB（对象、文档、图形、键等）最适合解决这个问题。

问题：

我有两个集合。 CollectionA包含2K+字符串（域名）。 CollectionB 更大，看起来（伪）如下所示：

{
    "To" : "[email protected],[email protected],there_could_be_100@more_address.com",  
    "Bcc" : "[email protected],[email protected],there_could_be_100@more_address.com",  
 "From" : "[email protected],[email protected],there_could_be_100@more_address.com", 
 "Subject" : "Email Subject", 
 "Unknown" : "NumberOfFields", 
 "N" : "PlusOneExtraFields", 
}

Knowns：

To、Bcc 和 From 字符串中可能列出了 100 个人。
我没有好的方法来分解“收件人”、“发件人”、“密件抄送”字段。
如果没有办法分解“收件人”、“发件人”、“密件抄送”字段，我就被迫搜索字符串。
有一些已知的领域，但有许多未知的领域。
这些要求当前不要求跨未知字段进行搜索。
数据库引擎需要在 Windows 桌面上运行。

当前思路：

使用 NoSQL 解决方案和也许 C# 动态关键字？

模糊

这是一个文档数据库可以轻松解决的问题吗？
在这种类型的数据结构中搜索/比较是否适合 Map/Reduce？

原文

I have a problem and I think a NoSQL solution is the answer but I am not sure. Also, I am not sure what type of NoSQL DB (Object,Document,Graph,Key,etc) would be best suited to solve this problem.

Problem:

I have two collections. CollectionA contains 2K+ strings (domain names). CollectionB is much larger and looks (pseudo) like this:

{
    "To" : "[email protected],[email protected],there_could_be_100@more_address.com",  
    "Bcc" : "[email protected],[email protected],there_could_be_100@more_address.com",  
 "From" : "[email protected],[email protected],there_could_be_100@more_address.com", 
 "Subject" : "Email Subject", 
 "Unknown" : "NumberOfFields", 
 "N" : "PlusOneExtraFields", 
}

Knowns:

There can be 100s of people listed in the To, Bcc, and From strings.
I don't have a good way to explode the To, From, Bcc fields.
Without a way to explode the To, From, Bcc fields I am forced to search strings.
There are a few known fields but many unknown fields.
The requirements don't currently call for searching across the unknown fields.
The database engine needs to run on a windows desktop.

Current line of thinking:

Using a NoSQL solution and maybe the C# dynamic keyword?

Fuzzy

Is this a problem that is easitly solved by a document database?
Is searching/comparing across this type of data structure something that for Map/Reduce?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

尴尬癌患者 2024-10-16 21:25:56

我完全同意@HighTechRider，如果数据量如您所暗示的那么大，那么在这种情况下，数据的非规范化（如您所说的那样爆炸）似乎是高性能查询的必须，否则您选择什么产品并不重要，它最终将成为某种时尚或其他时尚的自由文本扫描。

@chx 对 Sphinx 的建议似乎至少在加速后者方面是合理的。但这条路线存在隐性成本——需要您将其他人的服务与您的软件捆绑、安装、管理、修补、更新等。

最大限度地减少索引和查询中的桌面资源消耗必须成为高优先级，而在桌面上设置自由文本服务器似乎有点违反这一章程。

我将从基本文件系统开始 - 使用文件系统对象来表示非规范化数据。或者，如果表示和执行查询看起来太复杂，请先查看简单的嵌入式表库，例如 SQLite 或 SQL Compact 版本，然后再尝试将更奇特的针对服务器的产品硬塞到桌面上。

SQLite 与 SQL Compact Edition 的比较：

http://www.tech-archive.net/Archive/DotNet/microsoft.public.dotnet.framework.compactframework/2005-12/msg00019.html

SQLite 还可以创建自由文本涵盖未来一些“未知领域”场景的索引。

至于map-reduce，它的策略对于您正在接近的域是有效的。

回复收藏 0 原文

⒈起吃苦の倖褔 2024-10-16 21:25:56

以 XML 形式存储并使用 sphinx 进行搜索。使用 xmlpipe2 通过 grep 等方式向 sphinx 提供数据，仅将已知字段提供给其中。一旦您需要搜索更多内容，请将这些字段添加到您的过滤器和架构中并重新索引。 Sphinx 可以以这样的速度进行索引，这不会造成任何真正的问题。也可以分发。

您要求进行文本搜索，这意味着 solr 或 sphinx 以及这两个 sphinx 之间的设置在 Windows 桌面上更容易。

回复收藏 0 原文

星軌x 2024-10-16 21:25:56

我觉得这是 Apache lucene.net 的合适人选。

你可以像这样为上面指定的结构创建一个lucene文档

         Lucene.Net.Documents.Document doc = new Lucene.Net.Documents.Document();

         doc.Add( new Lucene.Net.Documents.Field(
             "To",
             ToData,
             Lucene.Net.Documents.Field.Store.YES,
             Lucene.Net.Documents.Field.Index.ANALYZED,
             Lucene.Net.Documents.Field.TermVector.WITH_POSITIONS_OFFSETS));


         doc.Add(new Lucene.Net.Documents.Field(
             "From",
             FromData,
             Lucene.Net.Documents.Field.Store.YES,
              Lucene.Net.Documents.Field.Index.ANALYZED,
             Lucene.Net.Documents.Field.TermVector.WITH_POSITIONS_OFFSETS));

         doc.Add(new Lucene.Net.Documents.Field(
            "BCC",
            BCCData,
            Lucene.Net.Documents.Field.Store.YES,
            Lucene.Net.Documents.Field.Index.ANALYZED,
             Lucene.Net.Documents.Field.TermVector.WITH_POSITIONS_OFFSETS));

    // Since you dont want Unknown field to be indexed, you can make it Index.NO.
        doc.Add(new Lucene.Net.Documents.Field(
            "Unknown",
            BCCData,
           Lucene.Net.Documents.Field.Store.YES,
             Lucene.Net.Documents.Field.Index.NO));

，但是lucene的问题是你不能在以后添加新的字段或修改现有的字段结构。因此，您必须删除文档并从头开始创建新文档。

更好的方法是让所有字段都可以为未知字段建立索引。

I feel this is a right candidate for Apache lucene.net .

You can create a lucene document for the above specified structure like this

         Lucene.Net.Documents.Document doc = new Lucene.Net.Documents.Document();

         doc.Add( new Lucene.Net.Documents.Field(
             "To",
             ToData,
             Lucene.Net.Documents.Field.Store.YES,
             Lucene.Net.Documents.Field.Index.ANALYZED,
             Lucene.Net.Documents.Field.TermVector.WITH_POSITIONS_OFFSETS));


         doc.Add(new Lucene.Net.Documents.Field(
             "From",
             FromData,
             Lucene.Net.Documents.Field.Store.YES,
              Lucene.Net.Documents.Field.Index.ANALYZED,
             Lucene.Net.Documents.Field.TermVector.WITH_POSITIONS_OFFSETS));

         doc.Add(new Lucene.Net.Documents.Field(
            "BCC",
            BCCData,
            Lucene.Net.Documents.Field.Store.YES,
            Lucene.Net.Documents.Field.Index.ANALYZED,
             Lucene.Net.Documents.Field.TermVector.WITH_POSITIONS_OFFSETS));

    // Since you dont want Unknown field to be indexed, you can make it Index.NO.
        doc.Add(new Lucene.Net.Documents.Field(
            "Unknown",
            BCCData,
           Lucene.Net.Documents.Field.Store.YES,
             Lucene.Net.Documents.Field.Index.NO));

But the problem with lucene is you cannot add new field or modify the existing field structure at later time. So you have to delete the documents and create the new ones from scracth.

Better approach would be make all your fields indexable for the unknown fields.

回复收藏 0 原文