如何下载更快?

发布于 12-14 00:06 字数 643 浏览 6 评论 0原文

将网页源下载到备忘录组件的最快方法是什么?我使用 Indy 和 HttpCli 组件。

问题是我有一个包含 100 多个站点的列表框,我的程序将源代码下载到备忘录中并解析该源代码以获取 mp3 文件。它有点像谷歌音乐搜索程序;它使用 Google 查询来使 Google 搜索更容易。

我开始阅读有关线程的内容,这导致了我的问题:我可以在具有解析函数的线程中创建一个 IdHttp 实例并告诉它解析列表框中一半的站点吗?

因此,基本上,当用户单击“解析”时,主线程应该执行:

for i := 0 to listbox1.items.count div 2 do
    get and parse

,而另一个线程应该执行:

for i := form1.listbox1.items.count div 2 to form1.listbox1.items.count - 1 do
    get and parse.

,因此它们会同时将解析的内容添加到 form1.listbox2 中。或者在主线程中启动两个 IdHttp 实例是否更容易?一个用于网站的前半部分,另一个用于第二部分?

为此:我应该使用 Indy 还是 Synapse?

What is the fastest way to download webpage source into a memo component? I use Indy and HttpCli components.

The problem is that I have a listbox filled with more than 100 sites, my program downloads source to a memo and parses that source for mp3 files. It is something like a Google music search program; it uses Google queries to make Google search easier.

I started reading about threads which lead to my question: Can I create a IdHttp instance in a thread with parsing function and tell it to parse half of the sites in the listbox?

So basically when a user clicks parse, the main thread should do:

for i := 0 to listbox1.items.count div 2 do
    get and parse

, and the other thread should do:

for i := form1.listbox1.items.count div 2 to form1.listbox1.items.count - 1 do
    get and parse.

, so they would add parsed content to form1.listbox2 in the same time. Or is it maybe easier to start two IdHttp instances in the main thread; one for first half of sites and other for second?

For this: should I use Indy or Synapse?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

享受孤独2024-12-21 00:06:16

我将创建一个可以读取单个 url 并处理其内容的线程。然后,您可以决定要同时触发多少个线程。您的计算机将允许相当多的连接,因此如果这 100 个站点具有不同的主机名,那么同时运行 10 个或 20 个站点并不是问题。太多就太过分了,但太少又浪费处理器时间。

您可以通过使用单独的线程进行下载和处理来进一步调整此过程,以便您可以有多个线程不断下载内容。下载并不是非常消耗处理器资源。它基本上是在等待响应,因此您可以轻松拥有相对大量的下载线程,而其他几个工作线程可以从结果池中获取项目并处理它们。
但是分开下载和处理会让事情变得更加复杂,而且我认为您还无法应对这个挑战。

因为目前,您还遇到了一些其他问题。首先,没有做到在线程中使用VCL组件。如果您需要来自线程中列表框的信息,则需要在线程中使用 Synchronize 对主线程进行“安全”调用,或者必须在启动线程之前传递所需的信息。后者效率更高,因为使用 Synchronize 执行的代码实际上在主线程中运行,从而使多线程效率降低。

但我的注意力实际上被第一行“将网页源下载到备忘录组件”所吸引。不要那样做!不要将这些结果加载到备忘录中进行处理。自动处理最好在内存中完成,不受视觉控制的影响。使用字符串、流甚至字符串列表来处理文本比使用备忘录要快得多。
字符串列表也有一些开销,但它使用相同的索引行结构(TMemoStrings,它是备忘录的 Lines 属性,和 TStringList 都有相同的祖先),所以如果你有使用的代码这样,将其转换为 TStringList 就很容易了。

I would create a thread that can read a single url and process its content. You can then decide how many of those threads you want to fire at the same time. Your computer will allow quite a number of connections, so if those 100 sites have different hostnames, it is not a problem to run 10 or 20 at the same time. Too much is overkill, but too little is a waste of processor time.

You can tweak this process even further by having separate threads for downloading and processing, so that you can have a number of threads constantly downloading content. Downloading is not very processor intensive. It is basically waiting for a response, so you can easily have a relatively large number of download threads, while a couple of other worker threads can grab items from the pool of results and process them.
But splitting downloading and processing will make it a little bit more complex, and I don't think you're up to that challenge yet.

Because currently, you got some other problems. At first, it is not done to use VCL components in a thread. If you need information from a listbox in a thread, you will either need to use Synchronize in the thread to make a 'safe' call to the main thread, or you will have to pass the information needed before you start the thread. The latter is more efficient, because code executed using Synchronize actually runs in the main thread, making your multi-threading less efficient.

But my attention actually was drawn to the first line, "download webpage source into memo component". Don't do that! Don't load those results in a memo for processing. Automatic processing can best be done in memory, outside of visual controls. Using strings, streams, or even stringlists for processing a text is way faster than using a memo.
A stringlist has some overhead as well, but it uses the same construction of indexing the lines (TMemoStrings, which is the Lines property of a Memo, and TStringList both have the same ancestor), so if you got code that makes use of this, it will be quite easy to convert it to TStringList.

无法回应2024-12-21 00:06:16

我建议在线程中进行所有解析,根本不要让主线程进行任何解析。主线程应该只管理 UI。不要从 TMemo 解析 HTML,让每个线程下载到 TStream 或 String,然后直接从中解析。使用 TIdSync 或 TIdNotify 将解析结果发送到 UI 进行显示(如果速度很重要,请使用 TIdNotify)。在解析逻辑中涉及 UI 组件会减慢速度。

I would suggest doing ALL of the parsing in threads, don't have the main thread do any parsing at all. The main thread should only manage the UI. Don't parse the HTML from a TMemo, have each thread download to a TStream or String and then parse from that directly. Use TIdSync or TIdNotify to send parsing results to the UI for display (if speed is important, use TIdNotify). Involving the UI components in your parsing logic will slow it down.

垂暮老矣2024-12-21 00:06:16

Indy 或 Synapse 都支持多线程。我建议使用 Synpase,它比 Indy 轻得多,并且足以满足您的目的。不要忘记 Microsoft 提供的 HTTP API

简单实现:

  • 每个 URI 一个线程;
  • 每个线程使用一次 HTTP 通信获取数据;
  • 然后各个线程解析数据;
  • 然后使用 Synchronize 刷新 UI。

也许我最喜欢的是:

  • 定义要使用的最大线程数(例如8);
  • 每个线程都将维护一个远程连接(这是 HTTP/1.1 的目的,并且确实可以改变速度);
  • 所有请求都由这些线程一一检索 - 不预先将 URL 分配给线程,而是在线程完成一个请求时从全局列表中检索新的 URL(每个 URL 并不总是花费相同的时间);
  • 线程可能会等待,直到任何其他 URI 添加到全局列表(例如使用 Sleep(100) 或信号量);
  • 然后使用专用的 GDI 消息 (WM_USER+...) 在主 GUI 线程中解析和更新 UI - 解析会很快(恕我直言)(请记住 UI 刷新可能会很慢 - 看一下例如 BeginUpdate-EndUpdate 方法) - 我发现 GDI 消息(带有关联的 HTML 数据)比使用阻塞后台线程的 Synchronize 更有效;
  • 另一种选择是在从 URI 检索数据之后在后台线程中进行解析 - 也许不值得(仅当您的解析器很慢时),并且如果您的解析器/数据处理器可能会遇到多线程问题不是 100% 线程安全的。

第二个是流行的所谓“下载管理器”的实现方式。

当您处理多线程时,您必须“保护”您的共享资源(例如列表)。使用TCriticalSection 访问任何全局列表(例如URI 列表),并尽快释放锁定。

并尝试使用多台计算机和网络、并发访问、不同的操作系统来测试您的实施。调试多线程应用程序可能很困难,因此实现越简单越好:这就是为什么我建议将下载部分设为多线程,但让主线程处理数据(数据不会很大,所以应该快点)。

Indy or Synapse are both multi-thread ready. I'd recommend using Synpase, which is much lighter than Indy, and will be sufficient enough for your purpose. Do not forget about the HTTP APIs provided by Microsoft.

Simple implementation:

  • One thread per URI;
  • Each thread gets the data using one HTTP communication;
  • Then each thread parse the data;
  • Then use Synchronize to refresh the UI.

Perhaps my favorite:

  • Define a number of maximum threads to be used (e.g. 8);
  • Each of these threads will maintain a remote connection (this is the purpose of HTTP/1.1 and can really make a difference about speed);
  • All requests are retrieved by those threads one by one - do not pre-assign URLs to threads, but retrieve a new URL from a global list when a thread has finished one (each URL does not take always the same time);
  • The threads may wait until any other URI is added to the global list (using a Sleep(100) or a semaphore e.g.);
  • Then parse and update the UI in the main GUI thread, using a dedicated GDI message (WM_USER+...) - parsing will be fast IMHO (and remember that UI refresh can be slow - take a look at BeginUpdate-EndUpdate methods for instance) - I found out that a GDI message (with the associated HTML data) is more efficient than using Synchronize which blocks the background thread;
  • Another option is to do the parsing in the background thread, just after having retrieved the data from its URI - perhaps not worth it (only if your parser is slow), and you may come into multi-threading issues if your parser/data processor is not 100% thread-safe.

The 2nd is how popular so-called "download managers" are implemented.

When you deal with multithreading, you'll have to "protect" your shared resources (lists, e.g.). Use a TCriticalSection to access any global list (e.g. the URI list), and release the lock as soon as possible.

And try to test your implementation with several computers and networks, concurrent access, diverse Operating Systems. Debugging multi-threaded applications can be difficult, so the simpler implementation the better: that is the reason why I recommend making the download part multi-threaded, but let the main thread process the data (which won't be huge, so it shall be fast).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文