将实时数据存储到 1000 个文件中

发布于 2024-07-21 07:35:10 字数 511 浏览 5 评论 0原文

我有一个程序可以接收 1000 个主题的实时数据。 它平均每秒接收 5000 条消息。 每条消息由两个字符串、一个主题和一个消息值组成。 我想保存这些字符串以及指示消息到达时间的时间戳。

我在“Core 2”硬件上使用 32 位 Windows XP,并使用 C# 进行编程。

我想将这些数据保存到 1000 个文件中——每个主题一个。 我知道很多人会想告诉我将数据保存到数据库中,但我不想走这条路。

我考虑了几种方法:

1) 打开 1000 个文件并在数据到达时写入每个文件。 我对此有两个担忧。 我不知道是否可以同时打开1000个文件,也不知道这会对磁盘碎片产生什么影响。

2) 写入一个文件,然后以某种方式对其进行处理以生成 1000 个文件。

3) 将所有内容保存在 RAM 中直到一天结束,然后一次写入一个文件。 我认为如果我有足够的内存,这会很好用,尽管我可能需要迁移到 64 位才能超过 2 GB 限制。

您将如何解决这个问题?

I have a program that receives real time data on 1000 topics. It receives -- on average -- 5000 messages per second. Each message consists of a two strings, a topic, and a message value. I'd like to save these strings along with a timestamp indicating the message arrival time.

I'm using 32 bit Windows XP on 'Core 2' hardware and programming in C#.

I'd like to save this data into 1000 files -- one for each topic. I know many people will want to tell me to save the data into a database, but I don't want to go down that road.

I've considered a few approaches:

1) Open up 1000 files and write into each one as the data arrives. I have two concerns about this. I don't know if it is possible to open up 1000 files simultaneously, and I don't know what effect this will have on disk fragmentation.

2) Write into one file and -- somehow -- process it later to produce 1000 files.

3) Keep it all in RAM until the end of the day and then write one file at a time. I think this would work well if I have enough ram although I might need to move to 64 bit to get over the 2 GB limit.

How would you approach this problem?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(11

情域 2024-07-28 07:35:10

我无法想象为什么您不想为此使用数据库。 这就是它们被建造的目的。 他们很擅长。

如果您不愿意走这条路,将它们存储在 RAM 中并每小时将它们轮换到磁盘可能是一种选择,但请记住,如果您被电源线绊倒,您就会丢失大量数据。

严重地。 数据库吧。

编辑:我应该补充一点,如果您准备好了硬件,那么获得一个强大的、可复制的和完整的数据库支持的解决方案将花费您不到一天的时间。

在任何其他环境中执行这种级别的事务保护将花费您数周的时间来设置和测试。

I can't imagine why you wouldn't want to use a database for this. This is what they were built for. They're pretty good at it.

If you're not willing to go that route, storing them in RAM and rotating them out to disk every hour might be an option but remember that if you trip over the power cable, you've lost a lot of data.

Seriously. Database it.

Edit: I should add that getting a robust, replicated and complete database-backed solution would take you less than a day if you had the hardware ready to go.

Doing this level of transaction protection in any other environment is going to take you weeks longer to set up and test.

帅冕 2024-07-28 07:35:10

像 n8wrl 一样,我也推荐一个 DB。 但如果你真的不喜欢这个功能......

让我们找到另一个解决方案;-)

在最小的步骤中我会采取两个线程。 第一个是工作人员,接收所有数据并将每个对象(时间戳、两个字符串)放入队列中。

另一个线程将检查此队列(可能通过事件信息或通过检查 Count 属性)。 该线程将使每个对象出队,打开特定文件,将其写下来,关闭文件并继续下一个事件。

通过第一种方法,我将开始并查看性能。 如果很糟糕,请进行一些计量,找出问题所在并尝试完成它(将打开的文件放入字典(名称、streamWriter)等)。

但另一方面,数据库可以很好地解决这个问题......
一张表,四列(id、时间戳、主题、消息),主题上的一个附加索引,准备就绪。

Like n8wrl i also would recommend a DB. But if you really dislike this feature ...

Let's find another solution ;-)

In a minimum step i would take two threads. First is a worker one, recieving all the data and putting each object (timestamp, two strings) into a queue.

Another thread will check this queue (maybe by information by event or by checking the Count property). This thread will dequeue each object, open the specific file, write it down, close the file and proceed the next event.

With this first approach i would start and take a look at the performance. If it sucks, make some metering, where the problem is and try to accomplish it (put open files into a dictionary (name, streamWriter), etc).

But on the other side a DB would be so fine for this problem...
One table, four columns (id, timestamp, topic, message), one additional index on topic, ready.

两仪 2024-07-28 07:35:10

首先计算带宽! 5000 条消息/秒,每 2kb = 10mb/秒。 每分钟 - 600mb。 好吧,你可以把它放到 RAM 中。 然后每小时冲洗一次。

编辑:更正错误。 对不起这是我的错。

First calculate the bandwidth! 5000 messages/sec each 2kb = 10mb/sec. Each minute - 600mb. Well you could drop that in RAM. Then flush each hour.

Edit: corrected mistake. Sorry, my bad.

等风来 2024-07-28 07:35:10

我同意 Oliver 的观点,但我建议进行修改:有 1000 个队列,每个主题/文件一个。 一个线程接收消息,给它们加上时间戳,然后将它们放入适当的队列中。 另一个只是轮流浏览队列,看看它们是否有数据。 如果是,它将读取消息,然后打开相应的文件并将消息写入其中。 关闭文件后,它会移动到下一个队列。 这样做的优点之一是,如果无法跟上流量,您可以添加额外的文件写入线程。 不过,我可能会首先尝试设置一个写入阈值(推迟处理队列,直到它收到 N 条消息)来批量写入。 这样,您就不会因为只写入一两条消息而陷入打开和关闭文件的困境。

I agree with Oliver, but I'd suggest a modification: have 1000 queues, one for each topic/file. One thread receives the messages, timestamps them, then sticks them in the appropriate queue. The other simply rotates through the queues, seeing if they have data. If so, it reads the messages, then opens the corresponding file and writes the messages to it. After it closes the file, it moves to the next queue. One advantage of this is that you can add additional file-writing threads if one can't keep up with the traffic. I'd probably first try setting a write threshold, though (defer processing a queue until it's got N messages) to batch your writes. That way you don't get bogged down opening and closing a file to only write one or two messages.

七颜 2024-07-28 07:35:10

我想更多地探讨一下为什么您不想使用数据库 - 他们非常擅长这样的事情! 但就你的选择而言...

  1. 1000 个打开的文件句柄听起来不太好。 忘记磁盘碎片——操作系统资源会很糟糕。
  2. 这接近于 db-ish-ness! 听起来也像是比其价值更多的麻烦。
  3. RAM = 易失性。 你花了一整天的时间积累数据,下午 5 点突然停电。

我该如何处理这个问题? D B! 因为这样我就可以查询索引、分析等等

。:)

I'd like to explore a bit more why you don't wnat to use a DB - they're GREAT at things like this! But on to your options...

  1. 1000 open file handles doesn't sound good. Forget disk fragmentation - O/S resources will suck.
  2. This is close to db-ish-ness! Also sounds like more trouble than it's worth.
  3. RAM = volatile. You spend all day accumulating data and have a power outage at 5pm.

How would I approach this? DB! Because then I can query index, analyze, etc. etc.

:)

旧时光的容颜 2024-07-28 07:35:10

我同意 Kyle 的观点,并选择像 PI 这样的打包产品。 请注意,PI 相当昂贵。

如果您正在寻找定制解决方案,我会选择斯蒂芬的解决方案,并进行一些修改。 让一台服务器接收消息并将其放入队列中。 但您不能使用文件将消息传递给其他进程,因为您将不断遇到锁定问题。 可能使用诸如 MSMQ(MS 消息队列)之类的东西,但我不确定其速度。

我还建议使用数据库来存储数据。 不过,您会想要将数据批量插入到数据库中,因为我认为您需要一些重型硬件来允许 SQL 每秒执行 5000 个事务。 您最好对队列中累积的每 10000 条消息进行批量插入。

数据大小:

平均消息 ~50 字节 ->
小日期时间 = 4 字节 + 主题(~10 个非 unicode 字符)= 10 字节 + 消息 -> 31 个字符(非 unicode)= 31 个字节。

50 * 5000 = 244kb/秒 -> 14MB/分钟-> 858MB/小时

I would agree with Kyle and go with a package product like PI. Be aware PI is quite expensive.

If your looking for a custom solution I'd go with Stephen's with some modifications. Have one server recieve the messages and drop them into a queue. You can't use a file though to hand off the message to the other process because your going to have locking issues constantly. Probably use something like MSMQ(MS message queuing) but I'm not sure on speed of that.

I would also recommend using a db to store your data. You'll want to do bulk inserts of data into the db though, as I think you would need some heafty hardware to allow SQL do do 5000 transactions a second. Your better off to do a bulk insert every say 10000 messages that accumulate in the queue.

DATA SIZES:

Average Message ~50bytes ->
small datetime = 4bytes + Topic (~10 characters non unicode) = 10bytes + Message -> 31characters(non unicode) = 31 bytes.

50 * 5000 = 244kb/sec -> 14mb/min -> 858mb/hour

心碎无痕… 2024-07-28 07:35:10

也许您不想要数据库安装的开销?

在这种情况下,您可以尝试基于文件系统的数据库,例如 sqlite

SQLite 是一个软件库,
实现了一个独立的,
无服务器、零配置、
事务性 SQL 数据库引擎。
SQLite 是部署最广泛的 SQL
世界上数据库引擎。 这
SQLite 的源代码位于
公共领域。

Perhaps you don't want the overhead of a DB install?

In that case, you could try a filesystem-based database like sqlite:

SQLite is a software library that
implements a self-contained,
serverless, zero-configuration,
transactional SQL database engine.
SQLite is the most widely deployed SQL
database engine in the world. The
source code for SQLite is in the
public domain.

溇涏 2024-07-28 07:35:10

我会制作两个单独的程序:一个用于获取传入请求,格式化它们并将它们写入一个文件,另一个用于从该文件读取并将请求写出。 通过这种方式,您可以最大限度地减少打开的文件句柄数量,同时仍然实时处理传入请求。 如果您制作第一个程序格式,它的输出正确,那么将其处理为单个文件应该很简单。

I would make 2 separate programs: one to take the incoming requests, format them, and write them out to one single file, and another to read from that file and write the requests out. Doing things this way allows you to minimize the number of file handles open while still handling the incoming requests in realtime. If you make the first program format it's output correctly then processing it to the individual files should be simple.

风和你 2024-07-28 07:35:10

我会保留传入消息的缓冲区,并定期在单独的线程上按顺序写入 1000 个文件。

I'd keep a buffer of the incoming messages, and periodically write the 1000 files sequentially on a separate thread.

月亮坠入山谷 2024-07-28 07:35:10

如果您不想使用数据库(我会,但假设您不想),我会将记录写入单个文件,追加操作尽可能快,并使用单独的进程/服务将文件拆分为 1000 个文件。 您甚至可以每 X 分钟滚动一次文件,例如,每 15 分钟您启动一个新文件,而另一个进程开始将它们拆分为 1000 个单独的文件。

所有这些确实引出了一个问题:为什么不是数据库,为什么需要 1000 个不同的文件 - 你可能有一个很好的理由 - 但话又说回来,也许你应该重新考虑你的策略并确保它在你之前是合理的推理沿着这条路走得很远。

If you don't want to use a database (and I would, but assuming you don't), I'd write the records to a single file, append operations are fast as they can be, and use a separate process/service to split up the file into the 1000 files. You could even roll-over the file every X minutes, so that for example, every 15 minutes you start a new file and the other process starts splitting them up into 1000 separate files.

All this does beg the question of why not a DB, and why do you need 1000 different files - you may have a very good reason - but then again, perhaps you should re-think you strategy and make sure it is sound reasoning before you go to far down this path.

如果没有你 2024-07-28 07:35:10

我会考虑购买实时数据历史包。 类似于 PI System 或 Wonderware Data Historian。 我之前曾在文件和 MS SQL 数据库中尝试过类似的操作,但结果并不好(这是客户的要求,我不会建议这样做)。 这些产品有 API,甚至还有包,您可以像 SQL 一样查询数据。

它不允许我发布超链接,因此只需谷歌搜索这两种产品,您就会找到有关它们的信息。

编辑

如果您确实像大多数人建议的那样使用数据库,我会为历史数据的每个主题推荐一个表,并考虑表分区、索引以及要存储数据的时间。

例如,如果您要存储一天的时间及其每个主题的一个表,那么您将查看每秒 5 次更新 x 一分钟 60 秒 x 一小时 60 分钟 x 24 小时 = 每天 432000 条记录。 导出数据后,我想您必须清除第二天的数据,这将导致锁定,因此您必须对写入数据库的操作进行排队。 然后,如果您要重建索引,以便可以对其进行任何查询,这将导致架构修改锁和 MS SQL Enterprise Edition 进行在线索引重建。 如果您不每天清除数据,则必须确保有足够的磁盘空间来保存数据。

基本上我所说的是权衡购买可靠产品和构建自己的产品的成本。

I would look into purchasing a real time data historian package. Something like a PI System or Wonderware Data Historian. I have tried to things like this in files and a MS SQL database before and it didn't turn out good (It was a customer requirement and I wouldn't suggest it). These products have API's and they even have packages where you can make queries to the data just like it was SQL.

It wouldn't allow me to post Hyperlinks so just google those 2 products and you will find information on them.

EDIT

If you do use a database like most people are suggesting I would recommend a table for each topic for historical data and consider table partitioning, indexes, and how long you are going to store the data.

For example if you are going to store a days worth and its one table for each topic, you are looking at 5 updates a second x 60 seconds in a minute x 60 minutes in an hour x 24 hours = 432000 records a day. After exporting the data I would imagine that you would have to clear the data for the next day which will cause a lock so you will have to have to queue you writes to the database. Then if you are going to rebuild the index so that you can do any querying on it that will cause a schema modification lock and MS SQL Enterprise Edition for online index rebuilding. If you don't clear the data everyday you will have to make sure you have plenty of disk space to throw at it.

Basically what I'm saying weigh the cost of purchasing a reliable product against building your own.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文