Linq关键字提取-限制提取范围

发布于 09-30 15:50 字数 237 浏览 4 评论 0原文

关于此解决方案

有没有办法限制要考虑的关键字数量?例如,我只想计算文本的前 1000 个单词。 Linq 中有一个“Take”方法,但它有不同的目的 - 将计算所有单词,并返回 N 条记录。正确地做到这一点的正确选择是什么?

With regards to this solution.

Is there a way to limit the number of keywords to be taken into consideration? For example, I'd like only first 1000 words of text to be calculated. There's a "Take" method in Linq, but it serves a different purpose - all words will be calculated, and N records will be returned. What's the right alternative to make this correctly?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

琉璃繁缕2024-10-07 15:50:22

只需提前应用 Take - 在调用 Split 之后立即应用:

var results = src.Split()
                 .Take(1000)
                 .GroupBy(...) // etc

Simply apply Take earlier - straight after the call to Split:

var results = src.Split()
                 .Take(1000)
                 .GroupBy(...) // etc
三寸金莲2024-10-07 15:50:22

嗯,严格来说,LINQ不一定会读取所有内容;它会读取所有内容。 Take 会尽快停止。问题是,在相关问题中,您查看计数,并且在不消耗所有数据的情况下很难获得计数。同样,string.Split 将查看所有内容

但是,如果您编写了一个惰性非缓冲 Split 函数(使用 yield return)并且您想要前 1000 个唯一单词,那么

var words = LazySplit(text).Distinct().Take(1000);

就可以了

Well, strictly speaking LINQ is not necessarily going to read everything; Take will stop as soon as it can. The problem is that in the related question you look at Count, and it is hard to get a Count without consuming all the data. Likewise, string.Split will look at everything.

But if you wrote a lazy non-buffering Split function (using yield return) and you wanted the first 1000 unique words, then

var words = LazySplit(text).Distinct().Take(1000);

would work

屋顶上的小猫咪2024-10-07 15:50:22

Enumerable.Take 实际上确实会输出结果;它不会完全缓冲其源代码,然后仅返回前 N 个。不过,看看您的原始解决方案,问题是您想要执行 Take 的输入是 字符串.Split。不幸的是,这个方法不使用任何类型的延迟执行;它急切地创建一个包含所有“分割”的数组,然后返回它。

因此,从某些文本中获取流式单词序列的技术类似于:

var words = src.StreamingSplit()  // you'll have to implement that            
               .Take(1000);

但是,我确实注意到查询的其余部分是:

...
.GroupBy(str => str)   // group words by the value
.Select(g => new
             {
                str = g.Key,      // the value
                count = g.Count() // the count of that value
              });

请注意 GroupBy 是一个缓冲操作 - 您可以预计来自其源的所有 1,000 个单词最终都会存储在通过管道输出的过程中的某个位置。

在我看来,选项是:

  1. 如果您不介意出于分割的目的遍历所有文本,那么src.Split().Take(1000)很好。缺点是浪费时间(在不再需要后继续拆分)和浪费空间(将所有单词存储在数组中,即使只有前 1,000 个单词)。但是,查询的其余将不会对超出必要的单词进行操作。
  2. 如果由于时间/内存限制而无法执行 (1),请使用 src.StreamingSplit().Take(1000) 或等效方法。在这种情况下,在找到 1,000 个单词后,将不再处理任何原始文本。

请注意,在这两种情况下,这 1,000 个单词本身最终都会被 GroupBy 子句缓冲。

Enumerable.Take does in fact stream results out; it doesn't buffer up its source entirely and then return only the first N. Looking at your original solution though, the problem is that the input to where you would want to do a Take is String.Split. Unfortunately, this method doesn't use any sort of deferred execution; it eagerly creates an array of all the 'splits' and then returns it.

Consequently, the technique to get a streaming sequence of words from some text would be something like:

var words = src.StreamingSplit()  // you'll have to implement that            
               .Take(1000);

However, I do note that the rest of your query is:

...
.GroupBy(str => str)   // group words by the value
.Select(g => new
             {
                str = g.Key,      // the value
                count = g.Count() // the count of that value
              });

Do note that GroupBy is a buffering operation - you can expect that all of the 1,000 words from its source will end up getting stored somewhere in the process of the groups being piped out.

As I see it, the options are:

  1. If you don't mind going through all of the text for splitting purposes, then src.Split().Take(1000) is fine. The downside is wasted time (to continue splitting after it is no longer necesary) and wasted space (to store all of the words in an array even though only the first 1,000) will be needed. However, the rest of the query will not operate on any more words than necessary.
  2. If you can't afford to do (1) because of time / memory constraints, go with src.StreamingSplit().Take(1000) or equivalent. In this case, none of the original text will be processed after 1,000 words have been found.

Do note that those 1,000 words themselves will end up getting buffered by the GroupBy clause in both cases.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文