使用 C# 多次比较 2 个巨大的列表(有一些变化)

发布于 2024-07-25 10:27:42 字数 1085 浏览 4 评论 0原文

大家好,这里的社区很棒。 我是一名电气工程师,兼职做一些“编程”工作以帮助支付账单。 我这样说是因为我希望您考虑到我没有接受过适当的计算机科学培训,但过去 7 年我一直在编码。

我有几个包含信息(全是数字)的 Excel 表格,基本上一列是“拨打的电话号码”,另一列是每个号码的分钟数。 另外,我有一个我所在国家/地区不同运营商的“运营商前缀代码号码”列表。 我想要做的是将每个运营商的所有“流量”分开。 场景如下:

首次拨打的号码行123456789ABCD,100 <-- 这将是一个 13 位数字的电话号码和 100 分钟。

我有一个包含运营商 1 的 12,000 多个前缀代码的列表,这些代码的长度各不相同,我需要检查其中的每个代码:

前缀代码 11234567 <- - 该代码有 7 位数字长。

我需要检查拨打号码的前 7 位数字,并将其与拨打的号码进行比较,如果发现匹配,我会将分钟数添加到小计中以供以后使用。 请考虑并非所有前缀代码的长度都相同,有时它们会更短或更长。

这大部分应该是小菜一碟,我应该能够这样做,但我对大量数据感到有点害怕; 有时,拨打的号码列表包含多达 30,000 个号码,而“运营商前缀代码”列出了大约 13,000 行长,而我通常会检查 3 个运营商,这意味着我必须进行大量“匹配”。

有谁知道如何使用 C# 有效地做到这一点? 或者说实话,任何其他语言。 我需要经常这样做,设计一个工具来做到这一点会更有意义。 我需要来自具有“计算机科学家”背景的人的良好观点。

列表不需要位于 Excel 工作表中,我可以导出到 csv 文件并从那里工作,我不需要“MS Office”界面。

感谢您的帮助。

更新:

感谢大家花时间回答我的问题。 我想由于我的无知,我过度夸大了“高效”这个词。 我不会每隔几秒钟执行一次此任务。 这是我每天必须做一次的事情,我讨厌使用 Excel 和 VLOOKUP 等。

我从你们那里学到了新概念,我希望我可以使用你们的想法构建一个解决方案。

Hey everyone, great community you got here. I'm an Electrical Engineer doing some "programming" work on the side to help pay for bills. I say this because I want you to take into consideration that I don't have proper Computer Science training, but I have been coding for the past 7 years.

I have several excel tables with information (all numeric), basically it is "dialed phone numbers" in one column and number of minutes to each of those numbers on another. Separately I have a list of "carrier prefix code numbers" for the different carriers in my country. What I want to do is separate all the "traffic" per carrier. Here is the scenario:

First dialed number row: 123456789ABCD,100 <-- That would be a 13 digit phone number and 100 minutes.

I have a list of 12,000+ prefix codes for carrier 1, these codes vary in length, and I need to check everyone of them:

Prefix Code 1: 1234567 <-- this code is 7 digits long.

I need to check the first 7 digits for the dialed number an compare it to the dialed number, if a match is found, I would add the number of minutes to a subtotal for later use. Please consider that not all prefix codes are the same length, some times they are shorter or longer.

Most of this should be a piece of cake, and I could should be able to do it, but I'm getting kind of scared with the massive amount of data; Some times the dialed number lists consists of up to 30,000 numbers, and the "carrier prefix code" lists around 13,000 rows long, and I usually check 3 carriers, that means I have to do a lot of "matches".

Does anyone have an idea of how to do this efficiently using C#? Or any other language to be kind honest. I need to do this quite often and designing a tool to do it would make much more sense. I need a good perspective from someone that does have that "Computer Scientist" background.

Lists don't need to be in excel worksheets, I can export to csv file and work from there, I don't need an "MS Office" interface.

Thanks for your help.

Update:

Thank you all for your time on answering my question. I guess in my ignorance I over exaggerated the word "efficient". I don't perform this task every few seconds. It's something I have to do once per day and I hate to do with with Excel and VLOOKUPs, etc.

I've learned about new concepts from you guys and I hope I can build a solution(s) using your ideas.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

菩提树下叶撕阳。 2024-08-01 10:27:43

在我看来,您需要从运营商前缀构建 trie 。 您最终将得到一个特里树,其中终止节点告诉您该前缀的运营商。

然后创建一个从载体到intlong(总计)的字典。

然后,对于每个拨打的号码行,只需沿着字典树向下查找,直到找到运营商。 查找运营商迄今为止的总分钟数,并添加当前行 - 然后继续。

It sounds to me like you need to build a trie from the carrier prefixes. You'll end up with a single trie, where the terminating nodes tell you the carrier for that prefix.

Then create a dictionary from carrier to an int or long (the total).

Then for each dialed number row, just work your way down the trie until you find the carrier. Find the total number of minutes so far for the carrier, and add the current row - then move on.

娇柔作态 2024-08-01 10:27:43

能够相当有效地完成此操作的最简单的数据结构是集合列表。 为每个运营商创建一个集合以包含所有前缀。

现在,要将呼叫与运营商关联起来:

foreach (Carrier carrier in carriers)
{
    bool found = false;

    for (int length = 1; length <= 7; length++)
    {
        int prefix = ExtractDigits(callNumber, length);

        if (carrier.Prefixes.Contains(prefix))
        {
            carrier.Calls.Add(callNumber);
            found = true;
            break;
        }
    }

    if (found)
        break;
}

如果您有 10 个运营商,则每个呼叫的集合中将有 70 次查找。 但是集合中的查找并不太慢(比线性搜索快得多)。 因此,与强力线性搜索相比,这应该会给你带来相当大的加速。

您可以更进一步,根据长度对每个运营商的前缀进行分组。 这样,如果运营商只有长度为 7 和 4 的前缀,您就会知道只需提取并查找这些长度,每次查找该长度的前缀集。

The easiest data structure that would do this fairly efficiently would be a list of sets. Make a Set for each carrier to contain all the prefixes.

Now, to associate a call with a carrier:

foreach (Carrier carrier in carriers)
{
    bool found = false;

    for (int length = 1; length <= 7; length++)
    {
        int prefix = ExtractDigits(callNumber, length);

        if (carrier.Prefixes.Contains(prefix))
        {
            carrier.Calls.Add(callNumber);
            found = true;
            break;
        }
    }

    if (found)
        break;
}

If you have 10 carriers, there will be 70 lookups in the set per call. But a lookup in a set isn't too slow (much faster than a linear search). So this should give you quite a big speed up over a brute force linear search.

You can go a step further and group the prefixes for each carrier according to the length. That way, if a carrier has only prefixes of length 7 and 4, you'd know to only bother to extract and look up those lengths, each time looking in the set of prefixes of that length.

南街女流氓 2024-08-01 10:27:43

将数据转储到几个数据库表中,然后使用 SQL 查询它们怎么样? 简单的!

CREATE TABLE dbo.dialled_numbers ( number VARCHAR(100), minutes INT )

CREATE TABLE dbo.prefixes ( prefix VARCHAR(100) )

-- now populate the tables, create indexes etc
-- and then just run your query...

SELECT p.prefix,
    SUM(n.minutes) AS total_minutes
FROM dbo.dialled_numbers AS n
    INNER JOIN dbo.prefixes AS p
        ON n.number LIKE p.prefix + '%'
GROUP BY p.prefix

(这是为 SQL Server 编写的,但转换为任何其他 DBMS 应该非常简单。)

How about dumping your data into a couple of database tables and then query them using SQL? Easy!

CREATE TABLE dbo.dialled_numbers ( number VARCHAR(100), minutes INT )

CREATE TABLE dbo.prefixes ( prefix VARCHAR(100) )

-- now populate the tables, create indexes etc
-- and then just run your query...

SELECT p.prefix,
    SUM(n.minutes) AS total_minutes
FROM dbo.dialled_numbers AS n
    INNER JOIN dbo.prefixes AS p
        ON n.number LIKE p.prefix + '%'
GROUP BY p.prefix

(This was written for SQL Server, but should be very simple to translate for any other DBMS.)

清风疏影 2024-08-01 10:27:43

也许在数据库中而不是在 C# 中执行此操作会更简单(不一定更有效)。

您可以在数据库中插入行,并在插入时确定运营商并将其包含在记录中(可能在插入触发器中)。

那么你的报告将是表上的求和查询。

Maybe it would be simpler (not necessarily more efficient) to do it in a database instead of C#.

You could insert the rows on the database and on insert determine the carrier and include it in the record (maybe in an insert trigger).

Then your report would be a sum query on the table.

世态炎凉 2024-08-01 10:27:43

我可能只是将条目放入列表中,对其进行排序,然后使用 二进制搜索 来寻找匹配项。 定制二分搜索匹配条件以返回第一个匹配的项目,然后沿列表迭代,直到找到不匹配的项目。 二分搜索只需大约 15 次比较即可搜索包含 30,000 个项目的列表。

I would probably just put the entries in a List, sort it, then use a binary search to look for matches. Tailor the binary search match criteria to return the first item that matches then iterate along the list until you find one that doesn't match. A binary search takes only around 15 comparisons to search a list of 30,000 items.

梦里泪两行 2024-08-01 10:27:43

您可能需要使用 C# 中的哈希表

这样你就有了键值对,你的键可以是电话号码,你的值可以是总分钟数。 如果在密钥集中找到匹配项,则修改总分钟数,否则添加新密钥。

然后,您只需要修改搜索算法,不要查看整个密钥,而只查看它的前 7 位数字。

You may want to use a HashTable in C#.

This way you have key-value pairs, and your keys could be the phone numbers, and your value the total minutes. If a match is found in the key set, then modify the total minutes, else, add a new key.

You would then just need to modify your searching algorithm, to not look at the entire key, but only the first 7 digits of it.

记忆之渊 2024-08-01 10:27:42

更新

您可以使用一个简单的技巧 - 按前缀的第一个数字将前缀分组到字典中,然后仅将数字与正确的子集进行匹配。 我使用以下两个 LINQ 语句对其进行了测试,假设每个前缀至少有三个数字。

const Int32 minimumPrefixLength = 3;

var groupedPefixes = prefixes
    .GroupBy(p => p.Substring(0, minimumPrefixLength))
    .ToDictionary(g => g.Key, g => g);

var numberPrefixes = numbers
    .Select(n => groupedPefixes[n.Substring(0, minimumPrefixLength)]
        .First(n.StartsWith))
    .ToList();

那么这有多快呢? 15.000 个前缀和 50.000 个数字花费的时间不到 250 毫秒。对于两行代码来说足够快吗?

请注意,性能在很大程度上取决于最小前缀长度 (MPL),因此取决于您可以构造的前缀组的数量。

     MPL    Runtime
    -----------------
     1     10.198 ms
     2      1.179 ms
     3        205 ms
     4        130 ms
     5        107 ms

只是为了提供一个粗略的想法 - 我只跑了一次,还有很多其他事情正在进行。

原始答案

我不太关心性能 - 一台普通的台式电脑可以轻松地处理具有 1 亿行的数据库表。 也许需要五分钟,但我假设您不想每隔一秒执行一次任务。

我刚刚做了一个测试。 我生成了一个包含 15.000 个 5 到 10 位数字的唯一前缀的列表。 从这个前缀中,我生成了 50.000 个带有前缀和附加 5 到 10 位数字的号码。

List<String> prefixes = GeneratePrefixes();
List<String> numbers = GenerateNumbers(prefixes);

然后我使用以下 LINQ to Object 查询来查找每个数字的前缀。

var numberPrefixes = numbers.Select(n => prefixes.First(n.StartsWith)).ToList();

好吧,在我的 2.0 GHz Core 2 Duo 笔记本电脑上大约需要一分钟。 因此,如果一分钟的处理时间是可以接受的,如果包括聚合,则可能是两分钟或三分钟,我不会尝试优化任何内容。 当然,如果程序能够在一两秒内完成任务,那就太好了,但这会增加相当多的复杂性,并且会导致很多错误。 设计、编写和测试需要时间。 LINQ 语句只花了我几秒钟的时间。

测试应用程序

请注意,生成许多前缀非常慢,可能需要一两分钟。

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Text;

namespace Test
{
    static class Program
    {
        static void Main()
        {
            // Set number of prefixes and calls to not more than 50 to get results
            // printed to the console.

            Console.Write("Generating prefixes");
            List<String> prefixes = Program.GeneratePrefixes(5, 10, 15);
            Console.WriteLine();

            Console.Write("Generating calls");
            List<Call> calls = Program.GenerateCalls(prefixes, 5, 10, 50);
            Console.WriteLine();

            Console.WriteLine("Processing started.");

            Stopwatch stopwatch = new Stopwatch();

            const Int32 minimumPrefixLength = 5;

            stopwatch.Start();

            var groupedPefixes = prefixes
                .GroupBy(p => p.Substring(0, minimumPrefixLength))
                .ToDictionary(g => g.Key, g => g);

            var result = calls
                .GroupBy(c => groupedPefixes[c.Number.Substring(0, minimumPrefixLength)]
                    .First(c.Number.StartsWith))
                .Select(g => new Call(g.Key, g.Sum(i => i.Duration)))
                .ToList();

            stopwatch.Stop();

            Console.WriteLine("Processing finished.");
            Console.WriteLine(stopwatch.Elapsed);

            if ((prefixes.Count <= 50) && (calls.Count <= 50))
            {
                Console.WriteLine("Prefixes");
                foreach (String prefix in prefixes.OrderBy(p => p))
                {
                    Console.WriteLine(String.Format("  prefix={0}", prefix));
                }

                Console.WriteLine("Calls");
                foreach (Call call in calls.OrderBy(c => c.Number).ThenBy(c => c.Duration))
                {
                    Console.WriteLine(String.Format("  number={0} duration={1}", call.Number, call.Duration));
                }

                Console.WriteLine("Result");
                foreach (Call call in result.OrderBy(c => c.Number))
                {
                    Console.WriteLine(String.Format("  prefix={0} accumulated duration={1}", call.Number, call.Duration));
                }
            }

            Console.ReadLine();
        }

        private static List<String> GeneratePrefixes(Int32 minimumLength, Int32 maximumLength, Int32 count)
        {
            Random random = new Random();
            List<String> prefixes = new List<String>(count);
            StringBuilder stringBuilder = new StringBuilder(maximumLength);

            while (prefixes.Count < count)
            {
                stringBuilder.Length = 0;

                for (int i = 0; i < random.Next(minimumLength, maximumLength + 1); i++)
                {
                    stringBuilder.Append(random.Next(10));
                }

                String prefix = stringBuilder.ToString();

                if (prefixes.Count % 1000 == 0)
                {
                    Console.Write(".");
                }

                if (prefixes.All(p => !p.StartsWith(prefix) && !prefix.StartsWith(p)))
                {
                    prefixes.Add(stringBuilder.ToString());
                }
            }

            return prefixes;
        }

        private static List<Call> GenerateCalls(List<String> prefixes, Int32 minimumLength, Int32 maximumLength, Int32 count)
        {
            Random random = new Random();
            List<Call> calls = new List<Call>(count);
            StringBuilder stringBuilder = new StringBuilder();

            while (calls.Count < count)
            {
                stringBuilder.Length = 0;

                stringBuilder.Append(prefixes[random.Next(prefixes.Count)]);

                for (int i = 0; i < random.Next(minimumLength, maximumLength + 1); i++)
                {
                    stringBuilder.Append(random.Next(10));
                }

                if (calls.Count % 1000 == 0)
                {
                    Console.Write(".");
                }

                calls.Add(new Call(stringBuilder.ToString(), random.Next(1000)));
            }

            return calls;
        }

        private class Call
        {
            public Call (String number, Decimal duration)
            {
                this.Number = number;
                this.Duration = duration;
            }

            public String Number { get; private set; }
            public Decimal Duration { get; private set; }
        }
    }
}

UPDATE

You can do a simple trick - group the prefixes by their first digits into a dictionary and match the numbers only against the correct subset. I tested it with the following two LINQ statements assuming every prefix has at least three digis.

const Int32 minimumPrefixLength = 3;

var groupedPefixes = prefixes
    .GroupBy(p => p.Substring(0, minimumPrefixLength))
    .ToDictionary(g => g.Key, g => g);

var numberPrefixes = numbers
    .Select(n => groupedPefixes[n.Substring(0, minimumPrefixLength)]
        .First(n.StartsWith))
    .ToList();

So how fast is this? 15.000 prefixes and 50.000 numbers took less than 250 milliseconds. Fast enough for two lines of code?

Note that the performance heavily depends on the minimum prefix length (MPL), hence on the number of prefix groups you can construct.

     MPL    Runtime
    -----------------
     1     10.198 ms
     2      1.179 ms
     3        205 ms
     4        130 ms
     5        107 ms

Just to give an rough idea - I did just one run and have a lot of other stuff going on.

Original answer

I wouldn't care much about performance - an average desktop pc can quiete easily deal with database tables with 100 million rows. Maybe it takes five minutes but I assume you don't want to perform the task every other second.

I just made a test. I generated a list with 15.000 unique prefixes with 5 to 10 digits. From this prefixes I generated 50.000 numbers with a prefix and additional 5 to 10 digits.

List<String> prefixes = GeneratePrefixes();
List<String> numbers = GenerateNumbers(prefixes);

Then I used the following LINQ to Object query to find the prefix of each number.

var numberPrefixes = numbers.Select(n => prefixes.First(n.StartsWith)).ToList();

Well, it took about a minute on my Core 2 Duo laptop with 2.0 GHz. So if one minute processing time is acceptable, maybe two or three if you include aggregation, I would not try to optimize anything. Of course, it would be realy nice if the programm could do the task in a second or two, but this will add quite a bit of complexity and many things to get wrong. And it takes time to design, write, and test. The LINQ statement took my only seconds.

Test application

Note that generating many prefixes is really slow and might take a minute or two.

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Text;

namespace Test
{
    static class Program
    {
        static void Main()
        {
            // Set number of prefixes and calls to not more than 50 to get results
            // printed to the console.

            Console.Write("Generating prefixes");
            List<String> prefixes = Program.GeneratePrefixes(5, 10, 15);
            Console.WriteLine();

            Console.Write("Generating calls");
            List<Call> calls = Program.GenerateCalls(prefixes, 5, 10, 50);
            Console.WriteLine();

            Console.WriteLine("Processing started.");

            Stopwatch stopwatch = new Stopwatch();

            const Int32 minimumPrefixLength = 5;

            stopwatch.Start();

            var groupedPefixes = prefixes
                .GroupBy(p => p.Substring(0, minimumPrefixLength))
                .ToDictionary(g => g.Key, g => g);

            var result = calls
                .GroupBy(c => groupedPefixes[c.Number.Substring(0, minimumPrefixLength)]
                    .First(c.Number.StartsWith))
                .Select(g => new Call(g.Key, g.Sum(i => i.Duration)))
                .ToList();

            stopwatch.Stop();

            Console.WriteLine("Processing finished.");
            Console.WriteLine(stopwatch.Elapsed);

            if ((prefixes.Count <= 50) && (calls.Count <= 50))
            {
                Console.WriteLine("Prefixes");
                foreach (String prefix in prefixes.OrderBy(p => p))
                {
                    Console.WriteLine(String.Format("  prefix={0}", prefix));
                }

                Console.WriteLine("Calls");
                foreach (Call call in calls.OrderBy(c => c.Number).ThenBy(c => c.Duration))
                {
                    Console.WriteLine(String.Format("  number={0} duration={1}", call.Number, call.Duration));
                }

                Console.WriteLine("Result");
                foreach (Call call in result.OrderBy(c => c.Number))
                {
                    Console.WriteLine(String.Format("  prefix={0} accumulated duration={1}", call.Number, call.Duration));
                }
            }

            Console.ReadLine();
        }

        private static List<String> GeneratePrefixes(Int32 minimumLength, Int32 maximumLength, Int32 count)
        {
            Random random = new Random();
            List<String> prefixes = new List<String>(count);
            StringBuilder stringBuilder = new StringBuilder(maximumLength);

            while (prefixes.Count < count)
            {
                stringBuilder.Length = 0;

                for (int i = 0; i < random.Next(minimumLength, maximumLength + 1); i++)
                {
                    stringBuilder.Append(random.Next(10));
                }

                String prefix = stringBuilder.ToString();

                if (prefixes.Count % 1000 == 0)
                {
                    Console.Write(".");
                }

                if (prefixes.All(p => !p.StartsWith(prefix) && !prefix.StartsWith(p)))
                {
                    prefixes.Add(stringBuilder.ToString());
                }
            }

            return prefixes;
        }

        private static List<Call> GenerateCalls(List<String> prefixes, Int32 minimumLength, Int32 maximumLength, Int32 count)
        {
            Random random = new Random();
            List<Call> calls = new List<Call>(count);
            StringBuilder stringBuilder = new StringBuilder();

            while (calls.Count < count)
            {
                stringBuilder.Length = 0;

                stringBuilder.Append(prefixes[random.Next(prefixes.Count)]);

                for (int i = 0; i < random.Next(minimumLength, maximumLength + 1); i++)
                {
                    stringBuilder.Append(random.Next(10));
                }

                if (calls.Count % 1000 == 0)
                {
                    Console.Write(".");
                }

                calls.Add(new Call(stringBuilder.ToString(), random.Next(1000)));
            }

            return calls;
        }

        private class Call
        {
            public Call (String number, Decimal duration)
            {
                this.Number = number;
                this.Duration = duration;
            }

            public String Number { get; private set; }
            public Decimal Duration { get; private set; }
        }
    }
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文