如何连续 QueueUserWorkItems 但不立即将它们全部排队？

发布于 2024-08-04 12:45:35 字数 2113 浏览 9 评论 0原文

我正在为网站开发一个多线程抓取器，根据不同的问题，我决定将 ThreadPool 与 QueueUserWorkItem() 一起使用。

如何连续对工作项目进行排队而不一次性将它们全部排队？我需要排队> 300k 个项目（每个用户 ID 一个），如果我循环将它们全部排队，我将耗尽内存。

所以，我想要的是：

// 1 = startUserID, 300000 = endUserID, 25 = MaxThreads  
Scraper webScraper = new Scraper(1, 300000, 25); 

webScraper.Start();  
// return immediately while webScraper runs in the background

在此期间，当线程可用时，webScraper 会不断添加所有 300000 个工作项。

这是我到目前为止所得到的：

public class Scraper
    {
        private int MaxUserID { get; set; }
        private int MaxThreads { get; set; }
        private static int CurrentUserID { get; set; }
        private bool Running { get; set; }
        private Parser StatsParser = new Parser();


        public Scraper()
            : this(0, Int32.MaxValue, 25)
        {
        }

        public Scraper(int CurrentUserID, int MaxUserID, int MaxThreads)
        {
            this.CurrentUserID = CurrentUserID;
            this.MaxUserID = MaxUserID;
            this.MaxThreads = MaxThreads;
            this.Running = false;

            ThreadPool.SetMaxThreads(MaxThreads, MaxThreads);
        }

        public void Start()
        {
            int availableThreads;

            // Need to start a new thread to spawn the new WorkItems so Start() will return right away?
            while (Running)
            {

                // if (!CurrentUserID >= MaxUserID)
                // {
                //     while (availableThreads > 0)
                //     {
                //         ThreadPool.QueueUserWorkItem(new WaitCallBack(Process));
                //     }
                // }
                // else
                // { Running = false; }
            }
        }

        public void Stop()
        {
            Running = false;
        }

        public static void process(object state)
        {
             var userID = Interlocked.Increment(ref CurrentUserID);
             ... Fetch Stats for userID
        }
    }

这是正确的方法吗？

任何人都可以为我指明正确的方向，以便在调用 Start() 后在后台处理工作项的创建，而不是立即创建所有工作项吗？

原文

I'm working on a multi-threaded scraper for a website and as per a different question I've decided to use the ThreadPool with QueueUserWorkItem().

How can I continually Queue work items without queuing them all at once? I need to queue > 300k items (one for each userID) and if I loop to queue them all I'll run out of memory.

So, what I would like is:

// 1 = startUserID, 300000 = endUserID, 25 = MaxThreads  
Scraper webScraper = new Scraper(1, 300000, 25); 

webScraper.Start();  
// return immediately while webScraper runs in the background

During this time, webScraper is continuouslly adding all 300000 workItems as threads become available.

Here is what I have so far:

public class Scraper
    {
        private int MaxUserID { get; set; }
        private int MaxThreads { get; set; }
        private static int CurrentUserID { get; set; }
        private bool Running { get; set; }
        private Parser StatsParser = new Parser();


        public Scraper()
            : this(0, Int32.MaxValue, 25)
        {
        }

        public Scraper(int CurrentUserID, int MaxUserID, int MaxThreads)
        {
            this.CurrentUserID = CurrentUserID;
            this.MaxUserID = MaxUserID;
            this.MaxThreads = MaxThreads;
            this.Running = false;

            ThreadPool.SetMaxThreads(MaxThreads, MaxThreads);
        }

        public void Start()
        {
            int availableThreads;

            // Need to start a new thread to spawn the new WorkItems so Start() will return right away?
            while (Running)
            {

                // if (!CurrentUserID >= MaxUserID)
                // {
                //     while (availableThreads > 0)
                //     {
                //         ThreadPool.QueueUserWorkItem(new WaitCallBack(Process));
                //     }
                // }
                // else
                // { Running = false; }
            }
        }

        public void Stop()
        {
            Running = false;
        }

        public static void process(object state)
        {
             var userID = Interlocked.Increment(ref CurrentUserID);
             ... Fetch Stats for userID
        }
    }

Is this the right approach?

Can anyone point me in the right direction for handling the creation of my work items while in the background once Start() is called, and not creating all Work items at once?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

暗地喜欢 2024-08-11 12:45:35

如果从工作队列中窃取工作的工作项较少，是否可以更好地实现这一点？仅仅因为你有 300,000 件工作要做，并不意味着你需要 300,000 名工人来完成。显然，由于您只有几个核心，因此只有少数工作可以并行发生，那么为什么不将大量工作分配给更少的工作人员呢？

根据每项工作所花费的时间的恒定程度，您可以将其平均分配给每个工作人员，也可以拥有一个中央队列（您必须锁定该队列），并且每个工作人员都可以在工作用完时获取一些工作。

编辑：

Joe Duffy 似乎有一系列关于在这里编写工作窃取队列的文章： http://www.bluebytesoftware.com/blog/2008/08/12/BuildingACustomThreadPoolSeriesPart2AWorkStealingQueue.aspx。 .Net 4 的线程池看起来也会更加智能。但我认为对于这种情况你不需要特别复杂的东西。

回复收藏 0 原文

再见回来 2024-08-11 12:45:35

我认为创建一个排队项目队列似乎不太正确，那么在 WorkItems 完成后再次让它们自己排队怎么样？

您的 Start 方法可以排队，例如 3 倍 MaxThreads 项（在您的示例中为 75），然后您的 Process 方法在完成时自行排队。这样，您的 Start 方法会快速返回，但会触发许多工作项，正如我所说，然后它们会自行触发：


    public class Scraper
    {
        private int MaxUserID { get; set; }
        private int MaxThreads { get; set; }
        private int currentUserID;
        private bool Running { get; set; }
        private Parser StatsParser = new Parser();

        private int Multiplier { get; set; }

        public Scraper()
            : this(0, Int32.MaxValue, 25)
        {
        }

        public Scraper(int currentUserID, int maxUserID, int maxThreads)
        {
            this.currentUserID = currentUserID;
            this.MaxUserID = maxUserID;
            this.MaxThreads = maxThreads;
            this.Running = false;

            ThreadPool.SetMaxThreads(maxThreads, maxThreads);
            Multiplier = 3;
        }

        public void Start()
        {
            Running = true;
            for (int i = 0; i < MaxThreads * Multiplier; i++)
            {
                ThreadPool.QueueUserWorkItem(Process);
            }
        }

        public void Stop()
        {
            Running = false;
        }

        public void Process(object state)
        {
            if (Running == false)
            {
                return;
            }
            if (currentUserID < MaxUserID)
            {
                Interlocked.Increment(ref currentUserID);
                //Parse stats for currentUserID
                ThreadPool.QueueUserWorkItem(Process);
            }
            else
            { Running = false; }
        }
    }

我确信为了安全起见，应该使用 Interlocked 设置 Running 标志。我已将乘数放入一个属性中，该属性可以传递给构造函数 - 我相当确定可以对其进行调整以调整性能，具体取决于解析这些统计数据所需的时间。

I think creating a queue of queued items doesn't seem quite right somehow, so how about making the WorkItems queue themselves again after they've finished?

Your Start method could queue up, say, 3 times MaxThreads items (75 in your example) and then your Process method queues itself when it's finished. That way your Start method returns quickly but fires off a number of work items, which as I say then fire themselves:


    public class Scraper
    {
        private int MaxUserID { get; set; }
        private int MaxThreads { get; set; }
        private int currentUserID;
        private bool Running { get; set; }
        private Parser StatsParser = new Parser();

        private int Multiplier { get; set; }

        public Scraper()
            : this(0, Int32.MaxValue, 25)
        {
        }

        public Scraper(int currentUserID, int maxUserID, int maxThreads)
        {
            this.currentUserID = currentUserID;
            this.MaxUserID = maxUserID;
            this.MaxThreads = maxThreads;
            this.Running = false;

            ThreadPool.SetMaxThreads(maxThreads, maxThreads);
            Multiplier = 3;
        }

        public void Start()
        {
            Running = true;
            for (int i = 0; i < MaxThreads * Multiplier; i++)
            {
                ThreadPool.QueueUserWorkItem(Process);
            }
        }

        public void Stop()
        {
            Running = false;
        }

        public void Process(object state)
        {
            if (Running == false)
            {
                return;
            }
            if (currentUserID < MaxUserID)
            {
                Interlocked.Increment(ref currentUserID);
                //Parse stats for currentUserID
                ThreadPool.QueueUserWorkItem(Process);
            }
            else
            { Running = false; }
        }
    }

I'm sure the Running flag should be being set using Interlocked for safety. I've made the multiplier into a property, which could be passed to the constructor - I'm fairly sure it could be adjusted to tweak performance, depending on how long those stats take to parse.

回复收藏 0 原文

你的往事 2024-08-11 12:45:35

看起来您需要一个主流程控制类来管理正在启动的工作人员数量并保持队列已满。

然后，您可以使用两个队列：

一个用于保存需要抓取的所有项目，
第二个用于完成工作，

然后该主/总督对象将保持循环，直到队列 #1 中的所有项目都消失并且它将继续添加到队列中#2 当你有可用的周期时。

回复收藏 0 原文

暗喜 2024-08-11 12:45:35

我绝对不会使用 ThreadPool.SetMaxThreads - ~~记住线程池在所有进程之间共享~~ - 设置最大线程数只会降低性能。线程池背后的整个想法是，您不需要指定诸如最大线程数之类的东西 - .Net 框架会计算出要分配的最佳线程数 - 您不需要这样做。

请注意，排队 300 000 个项目不会导致产生 300 000 个线程 - ThreadPool 类将为您管理线程数并根据需要重新使用线程。如果您只是担心这种方式会消耗太多资源，我建议您改进流程 - 也许创建一个“Spawner”类来运行 1000 个抓取实例？

回复收藏 0 原文

童话里做英雄 2024-08-11 12:45:35

您可以使用不同的线程池。这是一个：http://www.codeplex.com/smartthreadpool
它允许您一次对所有物品进行排队。您可以指定要创建的最大线程数。假设您有 1000 个工作项并分配了 100 个线程。它将立即获取前 100 个项目并开始运行，而其余的则等待。一旦其中一个项目完成并且线程释放，下一个排队项目就会启动。它管理所有工作，但不会使线程和内存饱和。此外，它不使用 .net 线程池中的线程。

回复收藏 0 原文

~没有更多了~