当前位置：文江博客话题详情

动态插值大型数据集

发布于 2024-08-27 06:08:44 字数 216 浏览 7 评论 0原文

插值大型数据集

我有一个包含大约 50 万条记录的大型数据集，代表给定一天中美元/英镑之间的汇率。

我有一个应用程序希望能够绘制这些数据或其子集的图表。出于显而易见的原因，我不想在图表上绘制 50 万个点。

我需要的是一个较小的数据集（100 个点左右），它（尽可能）准确地表示给定的数据。有谁知道可以通过哪些有趣且高效的方式来获取这些数据？

干杯，卡尔

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

旧城空念 2024-09-03 06:08:45

如何制作枚举/迭代器包装器。我对 Java 不熟悉，但它看起来可能类似于：

class MedianEnumeration implements Enumeration<Double>
{
    private Enumeration<Double> frameEnum;
    private int frameSize;

    MedianEnumeration(Enumeration<Double> e, int len) {
        frameEnum = e;
        frameSize = len;
    }

    public boolean hasMoreElements() {
        return frameEnum.hasMoreElements();
    }

    public Double nextElement() {
        Double sum = frameEnum.nextElement();

        int i;
        for(i=1; (i < frameSize) && (frameEnum.hasMoreElements()); ++i) {
            sum += (Double)frameEnum.nextElement();
        }

        return (sum / i);
    }
}

How about to make enumeration/iterator wrapper. I'm not familiar with Java, but it may looks similar to:

class MedianEnumeration implements Enumeration<Double>
{
    private Enumeration<Double> frameEnum;
    private int frameSize;

    MedianEnumeration(Enumeration<Double> e, int len) {
        frameEnum = e;
        frameSize = len;
    }

    public boolean hasMoreElements() {
        return frameEnum.hasMoreElements();
    }

    public Double nextElement() {
        Double sum = frameEnum.nextElement();

        int i;
        for(i=1; (i < frameSize) && (frameEnum.hasMoreElements()); ++i) {
            sum += (Double)frameEnum.nextElement();
        }

        return (sum / i);
    }
}

回复收藏 0 原文

又爬满兰若 2024-09-03 06:08:44

有多种统计方法可以将大型数据集缩减为更小、更易于可视化的数据集。从您的问题中不清楚您想要什么汇总统计数据。我只是假设您想了解汇率如何随时间变化，但也许您对汇率高于某个值的频率或我没有考虑的其他统计数据感兴趣。

总结一段时间内的趋势

以下是使用 lowess 方法的示例在 R 中（来自散点图平滑)：

> library(graphics)
# print out the first 10 rows of the cars dataset
> cars[1:10,]
   speed dist
1      4    2
2      4   10
3      7    4
4      7   22
5      8   16
6      9   10
7     10   18
8     10   26
9     10   34
10    11   17

# plot the original data
> plot(cars, main = "lowess(cars)")
# fit a loess-smoothed line to the points
> lines(lowess(cars), col = 2)
# plot a finger-grained loess-smoothed line to the points
> lines(lowess(cars, f=.2), col = 3)

参数 f 控制回归与数据的拟合程度。对此要深思熟虑，因为您希望能够准确地拟合您的数据而不会过度拟合。您可以绘制汇率与时间的关系，而不是速度和距离。

访问平滑结果也很简单。操作方法如下：

> data = lowess( cars$speed, cars$dist )
> data
$x
 [1]  4  4  7  7  8  9 10 10 10 11 11 12 12 12 12 13 13 13 13 14 14 14 14 15 15 15 16 16 17 17 17 18 18 18 18 19 19
[38] 19 20 20 20 20 20 22 23 24 24 24 24 25

$y
 [1]  4.965459  4.965459 13.124495 13.124495 15.858633 18.579691 21.280313 21.280313 21.280313 24.129277 24.129277
[12] 27.119549 27.119549 27.119549 27.119549 30.027276 30.027276 30.027276 30.027276 32.962506 32.962506 32.962506
[23] 32.962506 36.757728 36.757728 36.757728 40.435075 40.435075 43.463492 43.463492 43.463492 46.885479 46.885479
[34] 46.885479 46.885479 50.793152 50.793152 50.793152 56.491224 56.491224 56.491224 56.491224 56.491224 67.585824
[45] 73.079695 78.643164 78.643164 78.643164 78.643164 84.328698

您返回的数据对象包含名为 x 和 y 的条目，它们对应于传递到 lowess 函数的 x 和 y 值。在这种情况下，x和y代表速度和距离。

There are several statistical methods for reducing a large dataset to a smaller, easier to visualize dataset. It's not clear from your question what summary statistic you want. I've just assumed that you want to see how the exchange rate changes as a function of time, but perhaps you are interested in how often the exchange rate goes above a certain value, or some other statistic that I'm not considering.

Summarizing a trend over time

Here is an example using the lowess method in R (from the documentation on scatter plot smoothing):

> library(graphics)
# print out the first 10 rows of the cars dataset
> cars[1:10,]
   speed dist
1      4    2
2      4   10
3      7    4
4      7   22
5      8   16
6      9   10
7     10   18
8     10   26
9     10   34
10    11   17

# plot the original data
> plot(cars, main = "lowess(cars)")
# fit a loess-smoothed line to the points
> lines(lowess(cars), col = 2)
# plot a finger-grained loess-smoothed line to the points
> lines(lowess(cars, f=.2), col = 3)

The parameter f controls how tightly the regression fits to your data. Use some thoughtfulness with this, as you want something that accurately fits your data without overfitting. Rather than speed and distance, you could plot the exchange rate versus time.

It's also straightforward to access the results of the smoothing. Here's how to do that:

> data = lowess( cars$speed, cars$dist )
> data
$x
 [1]  4  4  7  7  8  9 10 10 10 11 11 12 12 12 12 13 13 13 13 14 14 14 14 15 15 15 16 16 17 17 17 18 18 18 18 19 19
[38] 19 20 20 20 20 20 22 23 24 24 24 24 25

$y
 [1]  4.965459  4.965459 13.124495 13.124495 15.858633 18.579691 21.280313 21.280313 21.280313 24.129277 24.129277
[12] 27.119549 27.119549 27.119549 27.119549 30.027276 30.027276 30.027276 30.027276 32.962506 32.962506 32.962506
[23] 32.962506 36.757728 36.757728 36.757728 40.435075 40.435075 43.463492 43.463492 43.463492 46.885479 46.885479
[34] 46.885479 46.885479 50.793152 50.793152 50.793152 56.491224 56.491224 56.491224 56.491224 56.491224 67.585824
[45] 73.079695 78.643164 78.643164 78.643164 78.643164 84.328698

The data object that you get back contains entries named x and y, which correspond to the x and y values passed into the lowess function. In this case, x and y represent speed and dist.

回复收藏 0 原文

去了角落 2024-09-03 06:08:44

一种想法是使用 DBMS 通过适当的查询来压缩数据。类似于让它取特定范围的中位数的伪查询：

SELECT truncate_to_hour(rate_ts), median(rate) FROM exchange_rates 
WHERE rate_ts >= start_ts AND rate_ts <= end_ts
GROUP BY truncate_to_hour(rate_ts)
ORDER BY truncate_to_hour(rate_ts)

其中 truncate_to_hour 适合您的 DBMS。或者使用类似的方法，使用某种函数将时间分割成独特的块（例如舍入到最近的 5 分钟间隔），或者使用另一个数学函数来聚合适合代替中位数的组。考虑到时间分段过程的复杂性以及 DBMS 的优化方式，使用分段时间值对临时表运行查询可能会更有效。

One thought is use the DBMS to compress the data for you using an appropriate query. Something along the lines of having it take a median for a specific range, a pseudo-query:

SELECT truncate_to_hour(rate_ts), median(rate) FROM exchange_rates 
WHERE rate_ts >= start_ts AND rate_ts <= end_ts
GROUP BY truncate_to_hour(rate_ts)
ORDER BY truncate_to_hour(rate_ts)

Where truncate_to_hour is something appropriate to your DBMS. Or a similar approach with some kind of function to segment the time into unique blocks (such as round to nearest 5 minute interval), or another math function to aggregate the group thats appropriate in place of median. Given the complexity of the time segmenting procedure and how your DBMS optimizes it may be more efficient to run a query on a temporary table with the segmented time value.

回复收藏 0 原文