使用 Windows Azure 和 F# 进行 Twitter Streaming API 记录和处理

发布于 2024-09-19 03:12:02 字数 532 浏览 4 评论 0原文

一个月前,我尝试使用 F# 代理来处理和记录 Twitter StreamingAPI 数据 此处< /a>.作为一个小练习,我尝试将代码传输到 Windows Azure。

到目前为止,我有两个角色:

  • 一个工作角色(发布者)将消息(一条消息是推文的 json)放入队列。

  • 数据转储到云表中。

这引发了很多问题:

  • 可以将工人角色视为代理人吗?
  • 实际上,消息可能大于 8 KB,因此我需要使用 blob 存储并将对 blob 的引用作为消息传递(或者还有其他方法吗?),这会影响性能吗?
  • 如果需要的话,我可以增加处理器辅助角色的实例数量,并且队列将神奇地处理得更快,这样说是否正确?

抱歉问了这么多问题,希望大家不要介意,

非常感谢!

A month ago I tried to use F# agents to process and record Twitter StreamingAPI Data here. As a little exercise I am trying to transfer the code to Windows Azure.

So far I have two roles:

  • One worker role (Publisher) that puts messages (a message being the json of a tweet) to a queue.

  • One worker role (Processor) that reads messages from the queue, decodes the json and dumps the data into a cloud table.

Which leads to lots of questions:

  • Is it okay to think of a worker role as an agent ?
  • In practice the message can be larger than 8 KB so I am going to need to use a blob storage and pass as message the reference to the blob (or is there another way?), will that impact performance ?
  • Is it correct to say that if needed I can increase the number of instances of the Processor worker role, and the queue will magically be processed faster ?

Sorry for pounding all these questions, hope you don't mind,

Thanks a lot!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

执手闯天涯 2024-09-26 03:12:02

有一个名为 Lokad.Cloud 的开源库,可以透明地处理大消息,您可以在 http 上查看它://code.google.com/p/lokad-cloud/

There is an opensource library named Lokad.Cloud which can process big message transparently, you can check it on http://code.google.com/p/lokad-cloud/

薯片软お妹 2024-09-26 03:12:02

可以将工人角色视为代理人吗?

是的,绝对是。

实际上,消息可能大于 8 KB,因此我需要使用 blob 存储并将对 blob 的引用作为消息传递(或者还有其他方法吗?),这会影响性能吗?

是的,使用您正在讨论的技术(将 JSON 保存到名称为“JSONMessage-1”的 blob 存储,然后将消息发送到内容为“JSONMessage-1”的队列)似乎是标准方法在 Azure 中传递大于 8KB 的消息。由于您对 Azure 存储进行 4 次调用而不是 2 次(1 次用于获取队列消息,1 次用于获取 Blob 内容,1 次用于从队列中删除,1 次用于删除 Blob),因此速度会较慢。会明显变慢吗?可能不会。
如果大量消息在 Base64 编码时小于 8KB(这是 StorageClient 库中的一个问题),那么您可以添加一些逻辑来确定如何发送它。

如果需要的话,我可以增加处理器辅助角色的实例数量,并且队列将神奇地处理得更快,这样说是否正确?

只要您编写的辅助角色是自包含的并且实例不会相互干扰,那么增加实例计数就会增加吞吐量。
如果您的角色主要只是读取和写入存储,那么您可能会先对辅助角色进行多线程处理,然后再增加实例数量,这样可以节省资金。

Is it okay to think of a worker role as an agent?

Yes, definitely.

In practice the message can be larger than 8 KB so I am going to need to use a blob storage and pass as message the reference to the blob (or is there another way?), will that impact performance ?

Yes, using the technique you're talking about (saving the JSON to blob storage with a name of "JSONMessage-1" and then sending a message to a queue with contents of "JSONMessage-1") seems to be the standard way of passing messages in Azure that are bigger than 8KB. As you're making 4 calls to Azure storage rather than 2 (1 to get the queue message, 1 to get the blob contents, 1 to delete from the queue, 1 to delete the blob) it will be slower. Will it be noticeably slower? Probably not.
If a good number of messages are going to be smaller than 8KB when Base64 encoded (this is a gotcha in the StorageClient library) then you can put in some logic to determine how to send it.

Is it correct to say that if needed I can increase the number of instances of the Processor worker role, and the queue will magically be processed faster ?

As long as you've written your worker role so that it's self contained and the instances don't get in each others way, then yes, increasing the instance count will increase the through put.
If you're role is mainly just reading and writing to storage, you might benefit by multi-threading the worker role first, before increasing the instance count which will save money.

皓月长歌 2024-09-26 03:12:02

可以考虑一下工人角色吗
作为代理人?

这是最完美的思考方式。想象一下麦当劳的员工。每个工作人员都有特定的任务,他们通过消息(口头)相互沟通。

实际上,消息可以更大
超过 8 KB,所以我需要使用
blob 存储并作为消息传递
对 blob 的引用(或者是否有
另一种方式?),这会影响
性能?

只要消息是不可变的,这就是最好的方法。字符串可能非常大,因此会分配到堆中。由于它们是不可变的,传递引用不是问题。

如果需要我这样说是否正确
可以增加实例数量
处理器工作者角色,以及
队列将被神奇地处理
更快?

您需要查看进程正在执行的操作并确定它是 IO 密集型还是 CPU 密集型。通常,IO 密集型进程将通过添加更多代理来提高性能。如果您为代理使用ThreadPool,即使对于 CPU 密集型进程,工作也会得到很好的平衡,但您会遇到限制。话虽这么说,不要害怕弄乱你的架构并测量每次运行的结果。这是平衡代理数量的最佳方法。

Is it okay to think of a worker role
as an agent ?

This is the perfect way to think of it. Imagine the workers at McDonald's. Each worker has certain tasks and they communicate with each other via messages (spoken).

In practice the message can be larger
than 8 KB so I am going to need to use
a blob storage and pass as message the
reference to the blob (or is there
another way?), will that impact
performance?

As long as the message is immutable this is the best way to do it. Strings can be very large and thus are allocated to the heap. Since they are immutable passing around references is not an issue.

Is it correct to say that if needed I
can increase the number of instances
of the Processor worker role, and the
queue will magically be processed
faster?

You need to look at what your process is doing and decide if it is IO bound or CPU bound. Typically IO bound processes will have an increase in performance by adding more agents. If you are using the ThreadPool for your agents the work will be balanced quite well even for CPU bound processes but you will hit a limit. That being said don't be afraid to mess around with your architecture and MEASURE the results of each run. This is the best way to balance the amount of agents to use.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文