当前位置：文江博客话题详情

使用集合在 c#.NET 中动态存储和检索 3,000,000 个单词

发布于 2024-09-26 15:22:49 字数 1215 浏览 2 评论 0原文

如何在不使用 SQL 的情况下动态存储和检索超过 3,000,000 个单词。

从文档中获取一个单词，然后检查该单词是否可用。

如果有的话，然后在相应的文档计数中增加它......

如果不可用即，新单词然后创建一个新列，然后增加文档计数并置零到所有其他文件。

例如..

我有 93,000 个文档，每个文档包含或多或少 5000 个单词。如果出现新单词，则添加新列。同样有 960000 个单词出现。

----------------Word1 word2 word3 word4 word5 ....---- 新单词 ... word96000

文档1< /strong> ----2 ----19 ----45 ----16 ----7 ---- ------….0 ----.. --- -..

文档2 ----4 ----6 ----3 ----56 ----3 ----...。 --------0 ----.. ----..

文档3 ----56 ----34 ----1 ----67 - ---4 ----...。 --------0 ----.. ----..

文档4 ----7 ----45 ----9 ----45 - ---6 ----...。 --------0 ----.. ----..

文档5 ----56 ----43 ----234 ----87 - ---46 ----...。 --------0 ----..

文档6 ----56 ----6 ----2 ----5 ----23 -- ——…… --------0 ----.. ----..

。 …… 。 ……

…… 。 ……

…… 。 .. ..

文档1000 ----5 ----9 ----9 ----89 ----34 ----...。 --------1 .. ..

添加的单词计数在相应文档的条目中动态更新。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

甜柠檬 2024-10-03 15:22:49

这种稀疏矩阵通常最好实现为字典的字典。

Dictionary<string, Dictionary<string, int> index;

但问题缺乏太多细节，无法提供更多建议。

Such a sparse matrix is often best implemented as a dictionary of dictionaries.

Dictionary<string, Dictionary<string, int> index;

But the question lacks too many details to give more advice.

回复收藏 0 原文

半衾梦 2024-10-03 15:22:49

为了避免浪费内存，我建议如下：

class Document {
   List<int> words;
}
List<Document> documents;

如果您有 1000 个文档，则创建 List;文档 = 新列表<文档>(1000);
现在，如果 document1 包含单词：word2、word19 和 word45，请将这些单词的索引添加到文档中

documents[0].words.add(2);
documents[0].words.add(19);
documents[0].words.add(45);

您可以修改代码以存储单词本身。
要查看单词 word2 重复了多少次，您可以扔掉整个文档列表，看看文档是否包含单词索引。

foreach (Document d in documents) {
   if (d.words.Contain(2)) {
      count++;
   }
}

To avoid wasting memory, I would suggest the following:

class Document {
   List<int> words;
}
List<Document> documents;

If you have 1000 documents then create List<Document> documents = new List<Document>(1000);
Now if document1 has the words: word2, word19 and word45, add the index of these words to your document

documents[0].words.add(2);
documents[0].words.add(19);
documents[0].words.add(45);

You can modify the code to store the words themselves.
To see how many times the word word2 is repeated, you can go throw the entire list of documents and see if the document contains the word index or not.

foreach (Document d in documents) {
   if (d.words.Contain(2)) {
      count++;
   }
}

回复收藏 0 原文

伪心 2024-10-03 15:22:49

var nWords = (from Match m in Regex.Matches(File.ReadAllText("big.txt").ToLower(), "[a-z]+")
              group m.Value by m.Value)
             .ToDictionary(gr => gr.Key, gr => gr.Count());

为您提供按单词和计数索引的词典列表。我确信您可以在读入每个文件时保存信息，然后构建最终列表。
或许？

var nWords = (from Match m in Regex.Matches(File.ReadAllText("big.txt").ToLower(), "[a-z]+")
              group m.Value by m.Value)
             .ToDictionary(gr => gr.Key, gr => gr.Count());

Provide you with a dictionary list indexed by word and count. I'm sure you could then save the info as each file is read in and then build up your final lists.
maybe?

回复收藏 0 原文

~没有更多了~