根据日期时间均衡数据行
这与我在这里提出的另一个问题类似,但它与它的不同之处在于我还无法自己找到答案。我认为介绍我的问题的最好方法是用图片:
我有几个文本文件(本例中有 4 个) ,每个文件都有数百万行数据,格式如下:
TIME DATA
File #1
104500 4098
104501 34098
104502 1321
104502 3408
104503 4587
104503 1204
104503 49858
104504 1029
104505 4058
104506 7576
File #2
104500 23408
104500 2131
104501 5686
104502 6839
104502 21838
104503 86760
104503 20812
104503 85719
104504 4877
104505 2220
104506 4847
File #3
104500 23042
104501 12391
104501 5857
104501 6979
104502 2196
104502 21039
104503 9263
104503 50573
104503 18361
104504 17545
104505 67612
104506 21075
File #4
104500 1193
104501 8664
104502 1028
104502 68561
104503 69178
104503 1230
104503 12048
104504 8843
104505 9910
104506 53978
104506 13722
问题是一个文件中给定时间的数据条目可能比另一个文件中的数据条目多或少。例如,在上图中,文件 #1 中只有一个 10:45:00 条目,但文件 #2 中有两个 10:45:00 条目。我希望每个文件每次输入的行数相同,因此在文件#1和#2的示例中,将在第一个“104500 4098”行之后添加“填充”行,并且该填充行只是其上方行的精确副本(在本例中为 104500 4098)。理想情况下,这些“填充”行将插入到正在读取的文本文件中,而不是写入新的文本文件中。
到目前为止,我想出的是,我需要:
--计算每个给定时间的行数
--查找每个给定时间哪个文件的行数最多
--插入“填充”行( s) 在每个文件中必要时
不幸的是我真的不知道如何做到这一点。我有一些想法,但目前它们都很模糊,所以我真的不知道我应该读什么。到目前为止,我想出的唯一真正的代码是,我可以使用 Directory.GetFiles 将目录中的所有文件分配给一个数组,然后我可以以这种方式循环遍历所有文件,但这并不能解决我的问题很远。
这些数据行由程序生成,然后将这些行写入文本文件。我无权访问生成数据行的代码。
如果有人对我如何实现这一目标有任何想法,我将不胜感激。
This is similar to another question I've asked here on SO, but it's different enough from it that I haven't been able to come up with an answer for it on my own yet. I think the best way to introduce my problem is with a picture:
I have several text files (4 in this example), each with millions of lines of data in the following format:
TIME DATA
File #1
104500 4098
104501 34098
104502 1321
104502 3408
104503 4587
104503 1204
104503 49858
104504 1029
104505 4058
104506 7576
File #2
104500 23408
104500 2131
104501 5686
104502 6839
104502 21838
104503 86760
104503 20812
104503 85719
104504 4877
104505 2220
104506 4847
File #3
104500 23042
104501 12391
104501 5857
104501 6979
104502 2196
104502 21039
104503 9263
104503 50573
104503 18361
104504 17545
104505 67612
104506 21075
File #4
104500 1193
104501 8664
104502 1028
104502 68561
104503 69178
104503 1230
104503 12048
104504 8843
104505 9910
104506 53978
104506 13722
The problem is that a given time in one file may have more or less data entries than it has in another file. In the picture above for example, there is only one entry for 10:45:00 in File #1, but there are two entries for 10:45:00 in File #2. I'm hoping to get each file to have the same amount of lines per time entry, so in my example with Files #1 and #2, a 'filler' line would be added after the the first '104500 4098' line, and this filler line would just be an exact copy of the line above it (104500 4098 in this case). Ideally these 'filler' lines would be inserted into the text files being read from, and not written to a new text file.
What I've come up with so far is that I need to:
--count the number of lines for each given time
--find which file has the highest number of lines for each given time
--insert the 'filler' line(s) in each file where necessary
Unfortunately I don't really know how to do any of that. I have some ideas, but they're all vague at this point so I don't really know what I should read up on yet. The only real code that I've come up with so far is that I can assign all files in the directory to an array using Directory.GetFiles, and I can then loop through all the files that way, but that doesn't get me very far.
These lines of data are generated by a program, which then writes the lines to text files. I don't have access to the code which generates the lines of data.
If anyone has any ideas as to how I might accomplish this, a hint would be much appreciated.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
让我们将这种情况简化为只有两个时间戳,我将提供答案。
下面我重新创建了三个文件。每个文件缓冲区的时间戳分别为 104500 和 104501,而第二个文件有两个 501,表示正在解决的问题。这意味着 file1 和 file3 只有一个 501。然后我模拟解析文件中的数据,并将它们投影到一个具有文件 ID、数据和时间戳的类持有者中。获取每个文件缓冲区的所有数据后,我将合并数据。将数据放在一个 IEnumerable 列表中,然后按时间进行分组; 这是最终处理(分组)的关键。
现在您所要做的就是提取感兴趣的时间单位并对该组进行计算,同时记住 file1 和 file3 的缺失数据。然后,您可以操纵分组结果,为缺失的值添加更多时间戳,或者只是弹出最后一个值。
答案:无论如何,不要在文件中工作,将数据放入内存中,并在计算时调整丢失的数据。
这是数据的样子,看看它是如何方便分组的(关键)乘以 104500 和 104501。只需从文件 1 - 3 中的所有值的分组中提取目标时间并进行计算即可。
这是组织它的代码(转储方法来自 Linqpad,它显示了数据,我将其如图所示)
}
更新:在时间片中提取
下面是将索引值提取到目标时间的代码。我认为这是一个时间片。当请求时间片时,代码必须足够智能,以便在请求的索引(时间片)超出范围时将最后一个值识别为默认值。
例如,文件 1 有一个项目,如果我要请求时间片索引 2,它应该检索最后一个值,即第一个值。如果我要求索引 100,它也应该返回该值。
让我们查看时间 104501 并获取该数据。然后我们将按文件 ID 进行分组
,ds3 的数据如下所示:
现在我们需要创建一种处理时间片提取并处理缺失索引(片)值的方法。为此,我将使用 DefaultIfEmpty 指定最后一个值如果我们要求太多,该文件将是默认文件。下面是该代码
然后,如果我们查看文件 2 并询问不存在的时间片 1、2 和 3(甚至 4+),我们期望 2、4、4、4 作为结果值。这是上面针对 ds3 的调用
Let us distill this situation down to only two time stamps and I will provide the answer.
Below I have recreated three files. Each of the file buffers have timestamps of 104500 and 104501, while the 2nd file has two 501s as expressed as the problem which is being addressed. That means that file1 and file3 only have one 501. Then I simulate the parsing the data from the files and project them into a class holder which has a file ID, the data and the timestamp. Once all data is acquired for each file buffer, I union the data. With the data in one IEnumerable list I then grouped by time; this is key to the eventual processing, the grouping.
Now all you have to do is extract the unit of time that is of interest and do the calculations on that set, keeping in mind the missing data for file1 and file3. You could then manipulate the grouped result to either add more timestamps for the missing or just pop off the last value.
Answer: Regardless, don't work within a file, put the data in memory and adjust to the missing data when you do the calculation.
Here is what the data looks like, see how it is convently grouped (the key) by times of 104500 an 104501. One justs extracts the target time from that grouping with all the values from file 1 - 3 and does the calcuation.
Here is the code to get it organized (the dump method is from Linqpad which shows the data, which I showed in the picture)
}
UPDATE: Extract at Timeslice
Below is code to extract an indexed value into a target time. I deem this a timeslice. When one asks for a timeslice, the code has to be smart enough to identify the last value as a default when the index (timeslice) asked for is out of the range.
For example the file 1 has one items, if I were to ask for a time slice index two it should retrieve the last value which is the first. If I ask for index 100 it should return that value as well.
So let us look at the time 104501 and get that data. Then we will group by the file ID
and our data looks like this for ds3:
Now we need to create a method which will handle the extraction of a timeslice and handle missing index (slice) values. To do that I will use DefaultIfEmpty to specify that the last value of the file will be a default if we ask for too many. Here is that code
Then if we look at file 2 and ask for time slices 1, 2 and a 3 (even 4+) which does not exist we expect 2, 4, 4, 4 as resultant values. Here are the calls against ds3 above
这并不简单。对于初学者来说,您不能只在文本文件中插入一行。您需要将该文件复制到一个新文件,并插入该过程中所需的行。然后,您可以删除旧文件并重命名新文件以取代它。
我还假设您在处理所有文件之前不知道哪个文件需要添加行。这意味着您需要将所有文件加载到内存中,在那里处理它们并写出结果,或者在每个文件上打开一个流,并为每个文件打开一个新文件,并将旧流中的数据处理到新流对于每个文件,根据需要插入行。
This is not going to be simple. For starters, you can't just insert a line in a text file. You need to copy the file to a new file, inserting the line required in the process. You can then delete the old file and rename the new file to take its place.
I'm assuming, too, that you don't know which file will need lines added before you process them all. This means that you either need to load all the files into memory, process them there, and write out the result, or open a stream on each file plus a new file for each, and process the data from the old stream to the new stream for each file, inserting lines as required.