计算客户数组中每个客户的中位销售额(新手)
我有一个从 csv 文件生成的客户对象数组:
日期、名称、销售额
03/01,阿尔法,110
03/23,阿尔法,25
01/02,测试版,135
...
并需要一种有效的方法来创建一系列具有中位数销售额的新独特客户并将其导出回 csv。可能有多达 500,000 条记录和 100,000 个唯一客户!
I have an array of customer objects generated from a csv file:
Date, Name, Sales
03/01, Alpha, 110
03/23, Alpha, 25
01/02, Beta, 135
...
and require an efficient way to create a new array of unique customers with median sales and export them back to csv. There could be as many as 500,000 records and 100,000 unique customers!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
- 将源数据拆分为每个客户的集合。
对于每个客户:
--按销售额排序
--如果记录数为奇数,则返回中间索引处的销售额
--如果记录数为偶数,则返回中间两侧记录的平均值
-将返回的记录放入你的结果数组。
-Split your source data into collections for each customer.
For each customer:
--Sort by sales
--if record count is odd, return the sales at the middle index
--if record count is even, return the avg of the records on either side of the middle
-Drop the returned record into your results array.
在这种情况下,我会使用:
键是客户名称(假设它们是唯一的,否则分配某种唯一的 ID?)
这些值是每个客户的销售额列表。填充此数组后,您可以继续排序并获取中间元素(如上所述)
或求和并除以元素数量以获得中位数。(这是错误的)排序(使用比较的方法需要 O(nlog n) 时间,其中 n 是要排序的列表的长度。
有一些选择算法可以返回 O(n) 中的第 k 个最小值,请查看下面的维基百科链接
In cases like this I would use:
The keys are the customer names (assuming they are unique, otherwise assign a unique ID of some sort?)
The values are lists of sales for each customer. After you have filled this array you may procceed in either sorting and getting the middle element (as mentioned above)
or summing and dividing by the number of elements to get the median.(this is wrong)Sorting (using a method which compares) takes O(nlog n) time where n is the length of the list to be sorted.
There are selection algorithms which can return the kth smallest value in O(n), check wikipedia link below