时间序列和相关性策略

发布于 2024-07-25 20:22:55 字数 860 浏览 4 评论 0原文

我想将各种时间序列关联起来并以 csv 文件或内存数据表 (.NET) 的形式呈现。 这些时间序列是时间值对的数组(实际上这些对象不仅仅包含时间和值)。 时间序列可能跨越不同的重叠周期,有些甚至可能存在漏洞(给定时间戳的缺失值)。

对于那些感兴趣的人,我正在使用 OPC HDA .NET 库从 OPC HDA 服务器中提取历史时间序列。

生成的数据表应该为每个时间序列有一列,所有时间序列都基于时间戳列按时间顺序排列。 请参阅下面的示例:

|-------|-------|-------|-------|-------|
   TIME    TS1     TS2     TS3     TS4
|-------|-------|-------|-------|-------|
    1       X               X       X
|-------|-------|-------|-------|-------|
    2       X       X       X       X
|-------|-------|-------|-------|-------|
    3       X       X               X
|-------|-------|-------|-------|-------|
    4       X       X       X 
|-------|-------|-------|-------|-------|
    5       X       X       X 
|-------|-------|-------|-------|-------|

实现此目标的最有效方法是什么? “有效”是指用最少的代码。 但考虑到时间序列可能变得相当大,内存使用也可能是一个问题。

I have various time-series I'd like to correlate and present as either a csv-file or in-memory datatable (.NET). These time-series are arrays of time-value-pairs (actually these are objects containing more than just time and value). The time-series may span across different overlapping periods, and some might even have holes (missing values for given timestamps).

For those interested, I'm using the OPC HDA .NET library to extract historic time-series from an OPC HDA server.

The resulting datatable should have one column for each time-series all in chronological order based on a timestamp column. See example below:

|-------|-------|-------|-------|-------|
   TIME    TS1     TS2     TS3     TS4
|-------|-------|-------|-------|-------|
    1       X               X       X
|-------|-------|-------|-------|-------|
    2       X       X       X       X
|-------|-------|-------|-------|-------|
    3       X       X               X
|-------|-------|-------|-------|-------|
    4       X       X       X 
|-------|-------|-------|-------|-------|
    5       X       X       X 
|-------|-------|-------|-------|-------|

What would be the most effective way of achieving this? With "effective" I mean with the least amount of code. But considering that the timeseries could become quite large, memory usage might also be an issue.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

丑疤怪 2024-08-01 20:22:55

您可以首先扫描所有当前系列的不同值(例如,将它们聚合在 HashSet 中),然后将它们简单地转储到日期数组中(在字典中存储日期和索引位置之间的匹配项)。

var distinctDates = allSeries
  .SelectMany(s => s.Values.Select(v => v.Date))
  .Distinct()
  .OrderBy(d => d)
  .ToArray();

var datePositions = distinctDates
  .Select((d,index) => new 
    {
      Date = d,
      Index = index
    }).
  .ToDictionary(x => x.Date, x => x.Index);

然后,创建一个宽度为“NumberOfSeries”、长度为“NumberOfDates”的锯齿状数组。 之后,对所有数据进行第二次扫描并将它们转储到它们的位置。

var values = new float[allSeries.Length][];
for (var i=0;i<allSeries.Length;i++)
{
  values[i] = new float[distinctDates.Length];
  var currentSerie = allSeries[i];
  foreach(var value in currentSerie.Values)
  {
    var index = datePositions[value.Date];
    values[i][index] = value.Value;
  }      
}

我在没有接触 VisualStudio 的情况下编写了这段代码,因此可能有一些拼写错误。 或者可能使用了 .NET 中不存在的一些 LINQ 方法(只需查看 Lokad.Shared。 dll)。 但你应该能够明白这个想法。

当我谈到这个主题时,还有一些注意事项:

  1. 如果您必须将所有内容同时保留在内存中,请选择锯齿状数组。 它比字典效率更高,并且比矩形数组的内存问题要少得多。

  2. 使 Value 对象尽可能小(即:float 而不是 double)。

  3. 如果预计时间序列值的数量将来会变大,则切勿以“每个值一行”的方式将值存储在数据库中。 建议使用 HDF (具有 .NET 接口)或使用持久时间系列数据库中二进制形式的片段(如时间序列数据库

坚持使用这些应该可以让您扩展到数亿个时间值,不会出现很多问题(做到了)。

You can first scan all present series for the distinct values (for example, aggregating them in a HashSet), then simply dump them into an array of dates (storing a match between date and index position in a dictionary).

var distinctDates = allSeries
  .SelectMany(s => s.Values.Select(v => v.Date))
  .Distinct()
  .OrderBy(d => d)
  .ToArray();

var datePositions = distinctDates
  .Select((d,index) => new 
    {
      Date = d,
      Index = index
    }).
  .ToDictionary(x => x.Date, x => x.Index);

Then, create a jagged array that has width of "NumberOfSeries" and length of "NumberOfDates". After that, do a second scan of all the data and dump them to their positions.

var values = new float[allSeries.Length][];
for (var i=0;i<allSeries.Length;i++)
{
  values[i] = new float[distinctDates.Length];
  var currentSerie = allSeries[i];
  foreach(var value in currentSerie.Values)
  {
    var index = datePositions[value.Date];
    values[i][index] = value.Value;
  }      
}

I wrote this code without touching VisualStudio, so I may have a few typos. Or there may be used a few LINQ methods that are not present in the .NET (just look in Lokad.Shared.dll). But you should be able to get the idea.

Some more notes, while I'm at the topic:

  1. Go for the jagged array, if you have to keep everything in the memory at once. It is way more efficient than a dictionary and has a lot less memory problems than a rectangular array.

  2. Keep Value objects as small as possible (i.e.: float instead of double).

  3. If number of time serie values is expected to go large in the future, then never store values in database in a "one row per value". It is recommended to either go for something like HDF (which has a .NET interface) or use persist time serie fragments in binary form in DB (as in time serie databases)

Sticking to these should allow you to scale up to hundreds of millions of time values without a lot of problems (done that).

甜嗑 2024-08-01 20:22:55

您可以使用嵌套字典之类的数据结构并迭代内容:

Dictionary <TimeSeries, Dictionary<DateTime, Value>> dict = new Dictionary<TimeSeries, Dictionary<DateTime, Value>>();

foreach (TimeSeries series in dict.Keys) {

    //table row output code goes here
    Dictionary<DateTime, Value> innerDict = dict[series];
    foreach (DateTime date in innerDict.Keys) {
        Value seriesValueAtTimeT = innerDict[date];
        //table column output code goes here
    }
}

根据您的需要,您的输出代码将写入其他内容,并且您可以将数据类型 TimeSeries、Value 等替换为您的实际数据类型。

You might go with a data structure like a nested dictionary and iterate over the contents:

Dictionary <TimeSeries, Dictionary<DateTime, Value>> dict = new Dictionary<TimeSeries, Dictionary<DateTime, Value>>();

foreach (TimeSeries series in dict.Keys) {

    //table row output code goes here
    Dictionary<DateTime, Value> innerDict = dict[series];
    foreach (DateTime date in innerDict.Keys) {
        Value seriesValueAtTimeT = innerDict[date];
        //table column output code goes here
    }
}

Where your output code is writing out to something else, depending on your needs, and you replace the datatypes TimeSeries, Value, etc., with your actual data types.

瞳孔里扚悲伤 2024-08-01 20:22:55

嘿克里斯。 我知道您已经接受了答案,但我想我会发布我使用的解决方案。 希望它对某人有用。 如果不是的话,至少它为我提供了一个在将来某个时候找到它的地方。 :-)

这是直接从 Excel 2007 代码模块发出的 VBA 代码。 它可以很容易地转换为.Net。

数据操作的关键是数据透视表对象。 我发现它在将数据放入您指定的布局方面非常有效。

Sub GetIndexData ()
Dim cn as ADODB.Connection, cmd As ADODB.Command, rs As ADODB.Recordset
Dim rPivotTopLeft As Range, rPivotBottomRight As Range

Application.ScreenUpdating = False
Application.Calculation = xlCalculationManual

'Get the data.'
Set cn = New ADODB.Connection
With cn
  .Provider = "SQLOLEDB"
  .ConnectionString = "Database=" & mDBName & ";" & _
                      "Server=" & mDBServerName & ";" & _
                      "UID=" & mDBUserID & ";" & _
                      "Password=" & mDBPassword & ";" & _
                      "Persist Security Info=True;"
  .CursorLocation = adUseClient
  .Open
End With

Set cmd = New ADODB.Command
Set rs = New ADODB.Recordset
With cmd
  .ActiveConnection = adoTools.DBConnection
  .CommandType = adCmdText
  .CommandText = "SELECT YourData From YourSource WHERE YourCritera"
  Set rs = .Execute
End With



If Not (rs.BOF And rs.EOF) Then 'Check that we have some data.'

'Put the data into a worksheet.'
With wsRawData
  .Cells.CurrentRegion.Clear

  Set rPivotTopLeft = .Range("A1")
  With ThisWorkbook.PivotCaches.Add(SourceType:=xlExternal)
    Set .Recordset = rs
    .CreatePivotTable _
        TableDestination:=rPivotTopLeft, _
        TableName:="MyPivotTable"
  End With

  'Massage the data into the desired layout.'
  With .PivotTables("MyPivotTable")
    .ManualUpdate = True

    .PivotFields("Date").Orientation = xlRowField
    .PivotFields("Index").Orientation = xlColumnField
    .AddDataField .PivotFields("Return"), "Returns", xlSum

    .DisplayFieldCaptions = False
    .ColumnGrand = False
    .RowGrand = False

    .ManualUpdate = False
  End With

  mMonthCount = Range(.Range("A3"), .Cells(Rows.Count, "A").End(xlUp)).Count
  mIndexCount = Range(.Range("B2"), .Cells(2, Columns.Count).End(xlToLeft)).Count

  'Convert pivot table to values.'
  Set rPivotBottomRight = .Cells(mMonthCount + 2, mIndexCount + 1)
  With .Range(rPivotTopLeft, rPivotBottomRight)
    .Copy
    .PasteSpecial Paste:=xlPasteValues, Operation:=xlNone, SkipBlanks:=False, Transpose:=False
  End With

  'Format the worksheet.'
  .Range("A3").Resize(mMonthCount, 1).NumberFormat = "mmm-yy"
  .Range("B3").Resize(mMonthCount, mIndexCount).NumberFormat = "0.00%"
  Union(.Rows(2), .Columns(1)).Font.Bold = True
  .Cells.ColumnWidth = 7.14
  .Rows(1).Delete

End With


rs.close
Set rs = Nothing
cmd.ActiveConnection = Nothing
Set cmd = Nothing
cn.close
Set cn = Nothing

End Sub

从那里可以相对容易地利用内置的 Excel 回归统计来输出相关矩阵。 通过这种技术,我可以在大约 45 秒内生成一个包含 600x600 相关矩阵的工作表。

请注意,应更改 .PivotFields 参数以适合数据源中数据的列名称。

Hey Chris. I realize you've already accepted an answer, but thought I'd post a solution I use. Hopefully it's of some use to someone. If not at least it provides a place for me to find it sometime in the future. :-)

This is VBA code fired off directly from an Excel 2007 code module. It could be easily converted to .Net.

The key for the data manipulation is the pivot table object. I've found it to be massively efficient at getting the data into the layout you've specified.

Sub GetIndexData ()
Dim cn as ADODB.Connection, cmd As ADODB.Command, rs As ADODB.Recordset
Dim rPivotTopLeft As Range, rPivotBottomRight As Range

Application.ScreenUpdating = False
Application.Calculation = xlCalculationManual

'Get the data.'
Set cn = New ADODB.Connection
With cn
  .Provider = "SQLOLEDB"
  .ConnectionString = "Database=" & mDBName & ";" & _
                      "Server=" & mDBServerName & ";" & _
                      "UID=" & mDBUserID & ";" & _
                      "Password=" & mDBPassword & ";" & _
                      "Persist Security Info=True;"
  .CursorLocation = adUseClient
  .Open
End With

Set cmd = New ADODB.Command
Set rs = New ADODB.Recordset
With cmd
  .ActiveConnection = adoTools.DBConnection
  .CommandType = adCmdText
  .CommandText = "SELECT YourData From YourSource WHERE YourCritera"
  Set rs = .Execute
End With



If Not (rs.BOF And rs.EOF) Then 'Check that we have some data.'

'Put the data into a worksheet.'
With wsRawData
  .Cells.CurrentRegion.Clear

  Set rPivotTopLeft = .Range("A1")
  With ThisWorkbook.PivotCaches.Add(SourceType:=xlExternal)
    Set .Recordset = rs
    .CreatePivotTable _
        TableDestination:=rPivotTopLeft, _
        TableName:="MyPivotTable"
  End With

  'Massage the data into the desired layout.'
  With .PivotTables("MyPivotTable")
    .ManualUpdate = True

    .PivotFields("Date").Orientation = xlRowField
    .PivotFields("Index").Orientation = xlColumnField
    .AddDataField .PivotFields("Return"), "Returns", xlSum

    .DisplayFieldCaptions = False
    .ColumnGrand = False
    .RowGrand = False

    .ManualUpdate = False
  End With

  mMonthCount = Range(.Range("A3"), .Cells(Rows.Count, "A").End(xlUp)).Count
  mIndexCount = Range(.Range("B2"), .Cells(2, Columns.Count).End(xlToLeft)).Count

  'Convert pivot table to values.'
  Set rPivotBottomRight = .Cells(mMonthCount + 2, mIndexCount + 1)
  With .Range(rPivotTopLeft, rPivotBottomRight)
    .Copy
    .PasteSpecial Paste:=xlPasteValues, Operation:=xlNone, SkipBlanks:=False, Transpose:=False
  End With

  'Format the worksheet.'
  .Range("A3").Resize(mMonthCount, 1).NumberFormat = "mmm-yy"
  .Range("B3").Resize(mMonthCount, mIndexCount).NumberFormat = "0.00%"
  Union(.Rows(2), .Columns(1)).Font.Bold = True
  .Cells.ColumnWidth = 7.14
  .Rows(1).Delete

End With


rs.close
Set rs = Nothing
cmd.ActiveConnection = Nothing
Set cmd = Nothing
cn.close
Set cn = Nothing

End Sub

From there its relatively easy to leverage the built in excel regression statistics to output a correlation matrix. With this technique I'm producing a worksheet with a 600x600 matrix of correlations in about 45 seconds.

Note that the .PivotFields parameters should be changed to fit the column names of your data from your data source.

懒猫 2024-08-01 20:22:55

如果内存使用成为问题,您可以做的一件事是从跟踪单个事件转向给定时间段内事件的列表计数。 您会失去一些关于事情发生时间的精确性,但经常像这样总结数据可以降低图片的复杂性并使趋势更加明显。

如果它不明显,我的意思是获取如下所示的数据

12:00 event1
12:01 event2
12:10 event1
12:11 event1

12:00-12:15 event1 3
12:00-12:15 event2 1

One thing you can do if memory usage becomes a problem is to move from tracking single events to tabulated counts of an event within a given time period. You lose some precision about exactly when things happened, but often summarizing your data like this can reduce the complexity of your picture and make trends more obvious.

In case it's not obvious, I mean taking data that looks like this:

12:00 event1
12:01 event2
12:10 event1
12:11 event1

to this:

12:00-12:15 event1 3
12:00-12:15 event2 1
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文