如何计算 Map 的中位数?

发布于 2024-09-05 20:56:04 字数 471 浏览 11 评论 0 原文

对于一个映射,其中键代表序列中的一个数字,而值代表该数字在序列中出现的频率,那么 Java 中算法的实现如何计算中位数?

例如:

1,1,2,2,2,2,3,3,3,4,5,6,6,6,7,7

在地图中:

Map<Int,Int> map = ...
map.put(1,2)
map.put(2,4)
map.put(3,3)
map.put(4,1)
map.put(5,1)
map.put(6,3)
map.put(7,2)

double median = calculateMedian(map);
print(median);

会导致:

> print(median);
3
>

所以我正在寻找的是calculateMedian的java实现。

For a map where the key represents a number of a sequence and the value the count how often this number appeared in the squence, how would an implementation of an algorithm in java look like to calculate the median?

For example:

1,1,2,2,2,2,3,3,3,4,5,6,6,6,7,7

in a map:

Map<Int,Int> map = ...
map.put(1,2)
map.put(2,4)
map.put(3,3)
map.put(4,1)
map.put(5,1)
map.put(6,3)
map.put(7,2)

double median = calculateMedian(map);
print(median);

would result in:

> print(median);
3
>

So what i am looking for is a java implementation of calculateMedian.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

许仙没带伞 2024-09-12 20:56:09
  • 使用 SortedMap,即 TreeMap
  • 迭代一次映射来计算元素总数,即所有出现次数的总和
  • 再次迭代并累加出现次数,直到完成达到了总数的一半。导致总和超过总数一半的数字是中位数
  • 广泛测试相差一误差
  • Use a SortedMap, i.e. a TreeMap
  • Iterate through the map once to calculate the total number of elements, i.e. the sum of all occurrences
  • Iterate again and add up occurences until you've reached half of the total. The number that caused the sum to exceed half of the total is the median
  • Test extensively for off-by-one errors
浅紫色的梦幻 2024-09-12 20:56:09

对于简单但可能不太高效的算法,我会这样做:

1。将地图展开为列表。

实际上:迭代地图并将键“value-times”添加到新列表中。最后对列表进行排序。

//...
List<Integer> field = new ArrayList<Integer>();
for (Integer key:map) {
  for (int i = 0; i < map.get(key); i++) {
    field.add(key);
  }
}
Collections.sort(field);

2.计算中位数

现在您必须实现方法intcalculateMedian(Listsorted)。这取决于您需要的中位数类型。如果只是样本中位数,则结果是最中间的值(对于具有奇数个元素的列表)或两个最中间值的平均值(对于具有偶数长度的列表)。请注意,该列表需要排序!

(参考:样本中位数/维基百科


好吧,好吧,尽管克里斯没有提到效率,这里有一个想法如何在不扩展地图的情况下计算样本中位数(!)...

Set<Integer> sortedKeys = new TreeSet<Integer>(map.keySet()); // just to be sure ;)
Integer median = null;  // Using Integer to have a 'invalid/not found/etc' state
int total = 0;
for (Integer key:sortedKeys) {
  total += map.get(key);
}
if (isOddNumber(total)) { // I don't have to implement everything, do I?
  int counter = total / 2;  // index starting with 0
  for (Integer key:sortedKeys) {
    middleMost -= map.get(key);
    if (counter < 0) {
      // the sample median was in the previous bin
      break;
    }
    median = key;
  }
} else {
  int lower = total/2;
  int upper = lower + 1;
  for (Integer key:sortedKeys) {
    lower -= map.get(key);
    upper -= map.get(key);
    if (lower < 0 && upper < 0) {
      // both middlemost values are in the same bin
      break;
    } else (lower < 0 || upper < 0) {
      // lower is in the previous, upper in the actual bin
      median = (median + key) / 2; // now we need the average
      break;
    }
    median = key;
  }
}

(我手头没有编译器 - 如果它有很多语法错误,请将其视为伪代码;))

For in easy but maybe not-so-efficient algorithm I'd do it like this:

1. expand the map to a list.

practically spoken: iterate through the map and add the key 'value-times' to the new list. Finally sort the list.

//...
List<Integer> field = new ArrayList<Integer>();
for (Integer key:map) {
  for (int i = 0; i < map.get(key); i++) {
    field.add(key);
  }
}
Collections.sort(field);

2. calculate the median

now you have to implement a method int calculateMedian(List<Integer> sorted). This depends on the kind of median you need. If it's just the sample median, then the result is either the middlemost value (for lists with an odd number of elements) or the average of the two middlemost values (for lists with an even length). Note, that the list needs to be sorted!

(Ref: Sample Median / wikipedia)


OK, OK, even though Chris didn't mention efficiency, here's an idea how to calculate the sample median (!) without expanding the map...

Set<Integer> sortedKeys = new TreeSet<Integer>(map.keySet()); // just to be sure ;)
Integer median = null;  // Using Integer to have a 'invalid/not found/etc' state
int total = 0;
for (Integer key:sortedKeys) {
  total += map.get(key);
}
if (isOddNumber(total)) { // I don't have to implement everything, do I?
  int counter = total / 2;  // index starting with 0
  for (Integer key:sortedKeys) {
    middleMost -= map.get(key);
    if (counter < 0) {
      // the sample median was in the previous bin
      break;
    }
    median = key;
  }
} else {
  int lower = total/2;
  int upper = lower + 1;
  for (Integer key:sortedKeys) {
    lower -= map.get(key);
    upper -= map.get(key);
    if (lower < 0 && upper < 0) {
      // both middlemost values are in the same bin
      break;
    } else (lower < 0 || upper < 0) {
      // lower is in the previous, upper in the actual bin
      median = (median + key) / 2; // now we need the average
      break;
    }
    median = key;
  }
}

(I have no compiler at hand - if it has to many syntax errors, treat it as pseudo code, please ;) )

扮仙女 2024-09-12 20:56:08

使用 Guava

Multiset<Integer> values = TreeMultiset.create();
Collections.addAll(values, 1,1,2,2,2,2,3,3,3,4,5,6,6,6,7,7);

现在您问题的答案是:

return Iterables.get(values, (values.size() - 1) / 2);

真的。就是这样。(或者检查大小是否均匀,并平均两个中心值,准确地说。)

如果计数特别大,使用多重集的 entrySet 并保持连续总和,但最简单的方法通常就可以了。

Using Guava:

Multiset<Integer> values = TreeMultiset.create();
Collections.addAll(values, 1,1,2,2,2,2,3,3,3,4,5,6,6,6,7,7);

Now the answer to your question is:

return Iterables.get(values, (values.size() - 1) / 2);

Really. That's it. (Or check if size is even and average the two central values, to be precise about it.)

If the counts are particularly large, it would be faster to use the multiset's entrySet and keep a running sum, but the simplest way is usually fine.

相思碎 2024-09-12 20:56:07

线性时间

如果您知道数字总数(在您的情况下为 16),您可以从地图的开头或结尾开始计算计数的总和,直到达到 round(n/第 2) 个元素,或者如果总和等于第 (n/2) 个元素和第 (n/2) 个元素的平均值 = 中位数

如果您不知道总数,则必须至少将所有内容都检查一遍。

次线性时间

如果您可以决定数据结构并可以进行预处理,请参阅维基百科

编辑:
因此,假设我们有一个带有计数的序列,我们可以做的是

  • 在插入 key ->; count 对维护另一个映射 - key -> running_total
  • 这样,您将拥有一个结构,在该结构中,您可以通过查看最后一个键的 running_total 来获取total_count
  • 并且您将能够进行二分搜索来定位元素运行总计接近total_count/2,

这将使内存使用量加倍,但中位数的性能为O(log n),total_count的性能为O(1)。

Linear time

If you know the total of the numbers (in your case it is 16) you can go from the beginning or the end of the map and sum up the counts until you get to round(n/2)th element, or in case the sum is even to average of floor(n/2)th and ceil(n/2)th elements = median.

If you don't know the total count you will have to go through all of them at least once.

Sublinear time

If you can decide on the data structure and can do pre-processing see wikipedia on selection algorithm and you might get even sublinear algorithm.
You can also get sublinear time if you know something about the distribution of the data.

EDIT:
So under assumption that we have a sequence with counts what we can do is

  • while inserting the key -> count pairs maintain another map - key -> running_total
  • this way you will have a structure in which you will be able to get total_count by looking at the last key's running_total
  • and you will be able to do a binary search to locate the element where running total is close to total_count/2

This will double the memory usage, but will give O(log n) performance for median and O(1) for total_count.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文