在线性时间内计算集合的模式(最频繁的元素)?

发布于 2024-10-02 08:09:52 字数 1705 浏览 5 评论 0原文

在 Skiena 的《算法设计手册》一书中,计算集合的众数(最频繁的元素),据说有一个 Ω(n log n)下界(这让我困惑),而且(我猜是正确的)不存在更快的最坏情况算法来计算模式。我只是对下限 Ω(n log n) 感到困惑。

请参阅该书的页面 Google 图书

但在某些情况下这肯定可以在线性时间内计算(最好的情况),例如通过像下面这样的 Java 代码(查找字符串中最常见的字符),“技巧”是使用哈希表来计算出现次数。这似乎是显而易见的。

那么,我对这个问题的理解缺少什么?

编辑:(谜团已解)正如StriplingWarrior指出的那样,如果仅使用比较,即没有内存索引,则下限成立,另请参阅:http://en.wikipedia.org/wiki/Element_distinctness_problem

// Linear time
char computeMode(String input) {
  // initialize currentMode to first char
  char[] chars = input.toCharArray();
  char currentMode = chars[0];
  int currentModeCount = 0;
  HashMap<Character, Integer> counts = new HashMap<Character, Integer>();
  for(char character : chars) {
    int count = putget(counts, character); // occurences so far
    // test whether character should be the new currentMode
    if(count > currentModeCount) {
      currentMode = character;
      currentModeCount = count; // also save the count
    }
  }
  return currentMode;
}

// Constant time
int putget(HashMap<Character, Integer> map, char character) {
  if(!map.containsKey(character)) {
    // if character not seen before, initialize to zero
    map.put(character, 0);
  }
 // increment
  int newValue = map.get(character) + 1;
  map.put(character, newValue);
  return newValue;
}

In the book "The Algorithm Design Manual" by Skiena, computing the mode (most frequent element) of a set, is said to have a Ω(n log n) lower bound (this puzzles me), but also (correctly i guess) that no faster worst-case algorithm exists for computing the mode. I'm only puzzled by the lower bound being Ω(n log n).

See the page of the book on Google Books

But surely this could in some cases be computed in linear time (best case), e.g. by Java code like below (finds the most frequent character in a string), the "trick" being to count occurences using a hashtable. This seems obvious.

So, what am I missing in my understanding of the problem?

EDIT: (Mystery solved) As StriplingWarrior points out, the lower bound holds if only comparisons are used, i.e. no indexing of memory, see also: http://en.wikipedia.org/wiki/Element_distinctness_problem

// Linear time
char computeMode(String input) {
  // initialize currentMode to first char
  char[] chars = input.toCharArray();
  char currentMode = chars[0];
  int currentModeCount = 0;
  HashMap<Character, Integer> counts = new HashMap<Character, Integer>();
  for(char character : chars) {
    int count = putget(counts, character); // occurences so far
    // test whether character should be the new currentMode
    if(count > currentModeCount) {
      currentMode = character;
      currentModeCount = count; // also save the count
    }
  }
  return currentMode;
}

// Constant time
int putget(HashMap<Character, Integer> map, char character) {
  if(!map.containsKey(character)) {
    // if character not seen before, initialize to zero
    map.put(character, 0);
  }
 // increment
  int newValue = map.get(character) + 1;
  map.put(character, newValue);
  return newValue;
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

清浅ˋ旧时光 2024-10-09 08:09:52

作者的逻辑似乎基于以下假设:比较是您唯一可用的操作。使用基于哈希的数据结构可以通过降低在大多数情况下需要进行比较的可能性来解决这个问题,从而基本上可以在恒定时间内完成此操作。

然而,如果这些数字是经过精心挑选的,总是会产生哈希冲突,那么您最终会有效地将哈希集转换为列表,这将使您的算法变为 O(n²)。正如作者指出的,首先简单地将值排序到列表中可以提供最佳的保证算法,尽管在大多数情况下哈希集会更好。

The author seems to be basing his logic on the assumption that comparison is the only operation available to you. Using a Hash-based data structure sort of gets around this by reducing the likelihood of needing to do comparisons in most cases to the point where you can basically do this in constant time.

However, if the numbers were hand-picked to always produce hash collisions, you would end up effectively turning your hash set into a list, which would make your algorithm into O(n²). As the author points out, simply sorting the values into a list first provides the best guaranteed algorithm, even though in most cases a hash set would be preferable.

少年亿悲伤 2024-10-09 08:09:52

那么,我对这个问题的理解中缺少什么?

在许多特定情况下,数组或哈希表就足够了。在“一般情况”下则不然,因为哈希表访问并不总是恒定时间。

为了保证恒定的时间访问,您必须能够保证每个 bin 中可能最终出现的键的数量受到某个常数的限制。对于字符来说,这相当容易,但是如果集合元素是双精度数或字符串,则情况就不是这样了(除了纯粹的学术意义上,例如存在有限数量的双精度值)。

So, what am I missing in my understanding of the problem?

In many particular cases, an array or hash table suffices. In "the general case" it does not, because hash table access is not always constant time.

In order to guarantee constant time access, you must be able to guarantee that the number of keys that can possibly end up in each bin is bounded by some constant. For characters this is fairly easy, but if the set elements were, say, doubles or strings, it would not be (except in the purely academic sense that there are, e.g., a finite number of double values).

夕嗳→ 2024-10-09 08:09:52

哈希表查找是摊销常数时间的,即一般来说,查找n个随机键的总成本是O(n)。在最坏的情况下,它们可以是线性的。因此,虽然通常它们可以将模式计算的阶数减少到 O(n),但在最坏的情况下,它会将模式计算的阶数增加到 O(n^2)。

Hash table lookups are amortized constant time, i.e., in general, the overall cost of looking up n random keys is O(n). In the worst case, they can be linear. Therefore, while in general they could reduce the order of mode calculation to O(n), in the worst case it would increase the order of mode calculation to O(n^2).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文