如何减少这个问题的执行时间(即更快的代码)

发布于 2024-09-17 10:28:16 字数 1511 浏览 6 评论 0原文

这个问题来自 Codechef.com [如果有人仍在解决这个问题,请在尝试之前不要进一步查看该帖子],虽然它运行正常,但我需要让它更快。我是 c,c++ 的初学者。 (我知道数组、字符串和指针,但不知道文件处理等)。那么有没有一种方法可以使这个程序运行得更快而不使其变得复杂(如果算法复杂,那就可以了)。 如果你提到你遵循哪本书,我也会接受复杂的编码:)。 我目前正在关注罗伯特·拉福尔。 程序如下:-

有 N 个数字 a[0],a[1]..a[N - 1]。最初都是 0。您必须执行两种类型的操作:

1) 将索引 A 和 B 之间的数字加 1。这由命令“0 A B”表示

2) 回答索引 A 和 B 之间有多少个数字可以整除3。这由命令“1 A B”表示。

输入:

第一行包含两个整数,N 和 Q。接下来的 Q 行的每一行的形式都是“0 A B”或“1 A B”,如上所述。

输出:

为“1 A B”形式的每个查询输出 1 行,其中包含相应查询所需的答案。

输入示例:

4 7
1 0 3
0 1 2
0 1 3
1 0 0
0 0 3
1 3 3
1 0 3

输出示例:

4
1
0
2

约束:

1 <= N <= 100000
1 <= Q <= 100000
0 <= A <= B <= N - 1

这是我的解决方案:-

#include<stdio.h>

int main()
{
    unsigned int n; //amount of numbers taken
    scanf("%u",&n);

    unsigned int arr[n],q,a,b,count=0,j,i;
    short int c;
    scanf("%u",&q);//cases taken  
    for(i=0;i<n;++i)
    {
      arr[i]=0;
    }    


    for(i=0;i<q;++i)
    {
      scanf("%d%u%u",&c,&a,&b);//here a,b are A and B respectively and c is either 
                               //o or 1

      if(c==0)
      {
        for(j=a;j<=b;++j)
          arr[j]++;
      }


      if(c==1)
      {
        for(j=a;j<=b;++j)
        {
          if(arr[j]%3==0)
            count++;
        }
       printf("%u\n",count);
      }
      count=0;
    }  

}

this question is from Codechef.com [if anyone is still solving this question dont look further into the post before trying yourself] and although it runs right, but i need to make it a lot faster.i am a beginner in c,c++.(i know stuff upto arrays,strings and pointers but not file handling etc).So is there a way to make this program run any faster without making it complex(its ok if its complex in algorithm).
i will accept complex coding too if u mention which book u followed where it is all given :) .
i currently am following Robert Lafore.
Here is the program:-

There are N numbers a[0],a[1]..a[N - 1]. Initally all are 0. You have to perform two types of operations :

1) Increase the numbers between indices A and B by 1. This is represented by the command "0 A B"

2) Answer how many numbers between indices A and B are divisible by 3. This is represented by the command "1 A B".

Input :

The first line contains two integers, N and Q. Each of the next Q lines are either of the form "0 A B" or "1 A B" as mentioned above.

Output :

Output 1 line for each of the queries of the form "1 A B" containing the required answer for the corresponding query.

Sample Input :

4 7
1 0 3
0 1 2
0 1 3
1 0 0
0 0 3
1 3 3
1 0 3

Sample Output :

4
1
0
2

Constraints :

1 <= N <= 100000
1 <= Q <= 100000
0 <= A <= B <= N - 1

HERE IS MY SOLUTION:-

#include<stdio.h>

int main()
{
    unsigned int n; //amount of numbers taken
    scanf("%u",&n);

    unsigned int arr[n],q,a,b,count=0,j,i;
    short int c;
    scanf("%u",&q);//cases taken  
    for(i=0;i<n;++i)
    {
      arr[i]=0;
    }    


    for(i=0;i<q;++i)
    {
      scanf("%d%u%u",&c,&a,&b);//here a,b are A and B respectively and c is either 
                               //o or 1

      if(c==0)
      {
        for(j=a;j<=b;++j)
          arr[j]++;
      }


      if(c==1)
      {
        for(j=a;j<=b;++j)
        {
          if(arr[j]%3==0)
            count++;
        }
       printf("%u\n",count);
      }
      count=0;
    }  

}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

旧故 2024-09-24 10:28:16

1) 将索引 A 和 B 之间的数字加 1。这由命令“0 A B”表示

2) 回答索引 A 和 B 之间有多少个数可以被 3 整除。这用命令“1 A B”表示。

最初数字为 0,因此可被 3 整除。加 1 使数字不可整除。下一个增量 - 数字仍然不可整除。第三次增量使数字再次可整除。

可以尝试的第一个优化是不要让数字增长到 2 以上:如果在增量期间数字从 2 变为 3,请将其设置回零。现在搜索范围变成与 0 的简单比较。(这样数组将包含而不是模 3 的数字。)

第二个优化是使用范围而不是普通数组,例如类似于 RLE:将所有具有相同整除性的相邻数字折叠到一个范围内。该数组将包含如下结构而不是数字:

struct extent {
   int start; /* 0 .. N-1; extent should have at least one number */
   int end;   /* 0 .. N   */
   int n;     /* 0, 1, 2; we are only interested in the result of % */
};

最初,该数组将包含覆盖所有数字{0, N, 0}的单个范围。在增量步骤期间,一个范围可能会被分割或与相邻的范围合并。这种表示会加快数字的计数,因为您不是逐一而是分块地遍历数组。 (如果所有范围仅包含一个元素,它仍然会降级为线性搜索。)


另一种方法可以是使用三个带有下标的集合来代替数组。 Set #0 将包含模 3 为 0 的所有数字的下标,set #1 - 1,set #2 - 2。由于在增量操作期间,我们需要进行搜索,而不是 std::set< /code> 最好使用例如 std::bitset 每一位都标记属于该集合的数字的索引。

注意:这样我们就根本不保留原始数字。我们隐式地只保留模 3 的结果。

在增量期间,我们需要找到索引属于哪个集合,例如集合 #n,并将索引移动到下一个(模 3)集合:将集合 n 中的位设置为 0,将集合 n + 1 (mod 3) 中的位设置为 1。现在,计算可被 3 整除的数字就像计算集合 #0 中的非零位一样简单。这可以通过创建一个临时 std::bitset 作为掩码来实现,其中 [A,B] 范围内的位设置为 1,并使用临时集掩码 # 0 并对结果位集调用 std::bitset::count()

1) Increase the numbers between indices A and B by 1. This is represented by the command "0 A B"

2) Answer how many numbers between indices A and B are divisible by 3. This is represented by the command "1 A B".

Initially numbers are 0 and thus are divisible by 3. Increment by one make the number non-divisible. Next increment - number still non-divisible. Third increment makes the number again divisible.

First optimization one can try is to not to let number grow above 2: if during increment number goes from 2 to 3, set it back to zero. Now search for the range become a simple comparison with 0. (That way array would contain instead of the number its modulo 3.)

Second optimization is to use extents instead of plain array, e.g. something similar to the RLE: collapse into a range all adjacent numbers with the same divisibility. Instead of numbers, the array would contain structures like that:

struct extent {
   int start; /* 0 .. N-1; extent should have at least one number */
   int end;   /* 0 .. N   */
   int n;     /* 0, 1, 2; we are only interested in the result of % */
};

Initially the array would contain single extent covering the all numbers {0, N, 0}. During increment step, a range might be split or joined with adjacent one. That representation would speed-up the counting of numbers as you would go over the array not one-by-one but in chunks. (It would still degrade to the linear search if all ranges are contain only one element.)


Another approach could be to use instead of the array three sets with indeces. Set #0 would contain all the indeces of numbers whose modulo 3 is 0, set #1 - 1, set #2 - 2. Since during increment operation, we need to do a search, instead of std::set it is better to use e.g. std::bitset with every bit marking the index of the number belonging to the set.

N.B. This way we do not keep the original numbers at all. We implicitly keep only the result of modulo 3.

During increment, we need to find which set the index belongs to, e.g. set #n, and move the index to the next (mod 3) set: set the bit to zero in the set n, set the bit to 1 in the set n + 1 (mod 3). Counting numbers divisible by 3 now ends up being as simple as counting the non-zero bits in the set #0. That can be accomplished by creating a temp std::bitset as a mask with bits in range [A,B] set to one, masking with the temp set the set #0 and calling std::bitset::count() on the resulting bitset.

会发光的星星闪亮亮i 2024-09-24 10:28:16

两个非常简单的优化:

您实际上只能在数组中存储模 3 的值,而不是实际值。

增量可以通过一个简单的查找表来完成(以避免比较和分支):

char increment[3] = { 1, 2 ,0 };
new_val = increment[old_val];

测试 3 整除性现在是与 0 进行比较 - 这比整数除法快得多。

Two very simple optimizations:

You could really only store the value modulo 3 in the array, and not the actual value.

The increment could be done by a simple lookup table (to avoid comparisons and branches):

char increment[3] = { 1, 2 ,0 };
new_val = increment[old_val];

Testing for 3-divisibility is now comparing to 0 - which is significantly faster than integer division.

梦在夏天 2024-09-24 10:28:16

您可以进行的一项改进是替换

if (c == 0) {
    //code here
}

if (c == 1) {
   // code here
}

为:

if (c == 0) {
   //...
} else if (c == 1) {
  //...
}

如果您确定 c 始终为 0 或 1,您还可以将 else if 替换为简单的 <代码>其他。

真正拖慢你速度的是 I/O。如果列表足够大,则可能需要 malloc 足够的内存来保存输入和输出。然后在进入循环之前收集所有输入并在最后显示输出。

One improvement that you can make is to replace

if (c == 0) {
    //code here
}

if (c == 1) {
   // code here
}

with:

if (c == 0) {
   //...
} else if (c == 1) {
  //...
}

if you know for sure that c will always be 0 or 1, you can also replace the else if with a simple else.

What's really slowing you down is the I/O. If it's a big enough list, it might pay to malloc enough memory to hold the input and output. Then collect all the input before you enter the loop and display the output at the end.

累赘 2024-09-24 10:28:16

据我所知,您的代码已经得到了前所未有的优化。

-亚历克斯

From what I can see, your code is as optimized as it will ever be.

-Alex

沙与沫 2024-09-24 10:28:16

我认为除了缺乏任何边界检查之外,您的解决方案似乎还不错。也许在检查“c”或开关时使用“else”,但这会节省您少量的时间。
我认为你不会在任何书中找到像这样无用的东西。

Your solution seems fine I think apart from lacking any bounds checking. Maybe use an 'else' when checking 'c' or a switch, but this will save you trivial amounts of time.
I don't think you'll find something as useless as this in any book.

清引 2024-09-24 10:28:16

这对我来说看起来非常高效。
我看到的一件事是您正在使用 可变长度数组,这对于C(据我所知),但在 C++ 中是非法的,对于 C++,您需要在数组上使用 std::vectornew

我能看到您提高性能的唯一地方是使用 Duff 的设备 进行部分循环展开我不建议将其用于玩具样品。

This looks pretty performant to me.
One thing I see is that you're using variable length arrays, this is OK for C (AFAIK) but illegal in C++, for C++ you will need to either use an std::vector or new up the array.

The only place I can see you improving your performance is by using Duff's Device for partial loop unrolling which I wouldn't recommend for a toy sample.

用心笑 2024-09-24 10:28:16

这非常复杂,但请跟我一起来。除了“27 年的编码经验”之外,我无法引用任何具体的地方。

原问题将数轴设置为自然整数 0,1,2,3,4,5,6...但是,我们只关心能被 3 整除的数字,所以让我们重新定义数轴,使其只有 3其中的值:{2,3,4}并重新映射数轴,使得:

0 => 4
1 => 2
2=> 3
3=> 4
4=> 2
5=> 3
6=> 4
.. 等等。

您会注意到,在我们的序列中,可被 3 整除的数字被映射到 4。为什么使用{2,3,4}? 4 在二进制中是 100,这意味着任何具有第 3 位设置的元素都可以被 3 整除。这很容易通过位运算进行测试。

由于我们使用 2,3,4 作为三进制序列,因此我们可以将数组元素大小减少到 4 位。我们将数组定义为 8 位值,但大小是我们需要的字节大小的一半(如果是奇数大小的数组则加 1),并将每个字节存储两个元素在数组中。添加和比较可以作为 SIMD(单指令、多数据)操作来完成,通过使用一些巧妙的位操作,每次循环迭代递增或检查最多 16 个元素。

这就是这个概念。现在来看代码。

首先,我们需要分配并初始化数组。

unsigned char *arr = malloc(n/2 + 1);

// Init all element values to 4:
memset(&arr, 0x44, n/2 + 1);

我们将通过将 8 个数组字节块转换为 uint_64 来一次递增 16 个元素,添加 0x1111111111111111,然后跳到下一个块。对 32 位、16 位、8 位和 4 位数学进行重复,操作结束时最多剩余 8、4、2 或 1。

在每次递增之前,任何值为 4 的值都需要在递增之前递减 3,以将数字保持在正确的位置。

以下是增量命令的代码(未经测试):

/**
   @param p
      p is the address of the byte with the first aligned element to be incremented, &arr[A/2] when A is even, &arr[A/2]+1 when A is odd.
   @param j
      j is the number of elements to increment.  (B-A) when A is even, (B-A-1) when A is odd.
 */ 
void increment_aligned_block(unsigned char *p, int j)
    uint64_t fours;

    while (j>16) {
       // Find the ones that are value 4
       fours = *p & 0x4444444444444444;
       // Decrement each that matches by 3
       *p -= (fours >> 1 | fours >> 2);

       // Add 1 to each of the 16 array elements in the block.
       (uint64_t)(*p) += 0x1111111111111111;
       p += 8; j -= 16;
    }
    if (j >= 8) {
        // repeat the above for 32-bits (8 elements)
        // left as an exercise for the reader.
        p += 4; j -= 8;
   }
    if (j >= 4) {
        // repeat the above for 16-bits (4 elements)
        // left as an exercise for the reader.
        p += 2; j -= 4;
    }
    if (j >= 2) {
        // repeat the above for 8-bits (2 elements)
        // left as an exercise for the reader.
        p += 1; j -= 2;
    }
    if (j == 1) {
        // repeat the above for 8-bits (1 elements)
        // left as an exercise for the reader.
    }
}

对于比较使用:

/**
   @param p
      p is the address of the byte with the first aligned element to be counted, &arr[A/2] when A is even, &arr[A/2]+1 when A is odd.
   @param j
      j is the number of elements to count.  (B-A) when A is even, (B-A-1) when A is odd.
 */ 
int count_aligned_block(unsigned char *p, int j)
    int count = 0;
    uint64_t divisible_map;

    while (j > 16) {
        // Find the values of 4 in the block
        divisible_map = (uint64_t)(*p) & 0x4444444444444444;

        // Count the number of 4s in the block,
        // 8-bits at a time
        while (divisible_map) {
          switch (divisible_map & 0x44) {
            case 0x04:
            case 0x40:
                count++;
                break;
            case 0x44:
                count += 2;
                break;
            default:
                break;
          }
          divisible_map >>= 8;
        }
    }
    // Repeat as above with 32, 16, 8 and 4-bit math.
    // Left as an exercise to the reader

    return count;
}

您可能已经注意到这些函数被称为 foo_aligned_block 并且 p 需要是第一个对齐的字节元素。那是什么?由于我们每个字节打包两个元素,因此起始元素索引必须与偶数对齐。如果文件中的命令是0 0 30,那么我们可以调用increment_algined_block(&arr[A/2], 30),没问题。但是,如果文件中的命令是 0 1 30,那么我们需要额外的代码来处理索引 1 处未对齐的第一个元素,然后调用 increment_aligned_block(&arr[A /2 + 1], 29)。再次,留给读者作为练习。


我想指出这并不是最理想的。

未对齐的访问通常非常昂贵。也就是说,从 8 字节对齐地址读取 8 字节值比从非对齐地址读取 8 字节值更快。我们可以添加额外的优化以仅调用 foo_aligned_block() ,以便保证所有访问都是对齐的。

This is pretty complicated, but stay with me. I can't cite any specific place I got this from except "27 years of coding experience."

The original problem has the number line set to natural integers 0,1,2,3,4,5,6... However, we only care about numbers that are divisible by 3, so let's redefine our number line to have only three values in it: {2,3,4} and remap the number line such that:

0 => 4
1 => 2
2 => 3
3 => 4
4 => 2
5 => 3
6 => 4
.. and so on.

You'll notice that numbers that are divisible by 3 are mapped to 4 in our sequence. Why use {2,3,4}? 4 is 100 in binary, which means any element with its 3rd bit set will be divisible by 3. This is easy to test with bit operations.

Since we're using 2,3,4 as the trinary sequence, we can reduce our array element size to 4-bits. We'll define the array as 8-bit values, but half the size as we need in bytes (plus 1 in case it is an odd-sized array), and store the elements two per byte in the array. The adds and compares can be done as SIMD (single instruction, multiple data) operations, incrementing or checking up to 16 elements per loop iteration by using some clever bit operations.

That's the concept. Now down to the code.

First, we need to allocate and initialize our array.

unsigned char *arr = malloc(n/2 + 1);

// Init all element values to 4:
memset(&arr, 0x44, n/2 + 1);

We're going to increment 16 elements at a time by casting a block of 8 array bytes to a uint_64, add 0x1111111111111111 and then skip to the next block. Repeat with 32-bit, 16-bit, 8-bit and 4-bit math for up to 8, 4, 2 or 1 remaining at the end of the operation.

Before each increment, anything that has a value of 4 needs to be decremented by 3 before the increment to keep the numbers in the proper position.

Here's the code (untested) for the increment command:

/**
   @param p
      p is the address of the byte with the first aligned element to be incremented, &arr[A/2] when A is even, &arr[A/2]+1 when A is odd.
   @param j
      j is the number of elements to increment.  (B-A) when A is even, (B-A-1) when A is odd.
 */ 
void increment_aligned_block(unsigned char *p, int j)
    uint64_t fours;

    while (j>16) {
       // Find the ones that are value 4
       fours = *p & 0x4444444444444444;
       // Decrement each that matches by 3
       *p -= (fours >> 1 | fours >> 2);

       // Add 1 to each of the 16 array elements in the block.
       (uint64_t)(*p) += 0x1111111111111111;
       p += 8; j -= 16;
    }
    if (j >= 8) {
        // repeat the above for 32-bits (8 elements)
        // left as an exercise for the reader.
        p += 4; j -= 8;
   }
    if (j >= 4) {
        // repeat the above for 16-bits (4 elements)
        // left as an exercise for the reader.
        p += 2; j -= 4;
    }
    if (j >= 2) {
        // repeat the above for 8-bits (2 elements)
        // left as an exercise for the reader.
        p += 1; j -= 2;
    }
    if (j == 1) {
        // repeat the above for 8-bits (1 elements)
        // left as an exercise for the reader.
    }
}

For the compare use:

/**
   @param p
      p is the address of the byte with the first aligned element to be counted, &arr[A/2] when A is even, &arr[A/2]+1 when A is odd.
   @param j
      j is the number of elements to count.  (B-A) when A is even, (B-A-1) when A is odd.
 */ 
int count_aligned_block(unsigned char *p, int j)
    int count = 0;
    uint64_t divisible_map;

    while (j > 16) {
        // Find the values of 4 in the block
        divisible_map = (uint64_t)(*p) & 0x4444444444444444;

        // Count the number of 4s in the block,
        // 8-bits at a time
        while (divisible_map) {
          switch (divisible_map & 0x44) {
            case 0x04:
            case 0x40:
                count++;
                break;
            case 0x44:
                count += 2;
                break;
            default:
                break;
          }
          divisible_map >>= 8;
        }
    }
    // Repeat as above with 32, 16, 8 and 4-bit math.
    // Left as an exercise to the reader

    return count;
}

You might have noticed the functions are called foo_aligned_block and p needs to be the byte of the first aligned element. What is that? Since we're packing two elements per byte, the starting element index must be aligned to an even number. If the command in the file is 0 0 30, then we can call increment_algined_block(&arr[A/2], 30), no problem. However, if the command in the file is 0 1 30, then we need to have additional code to handle the unaligned first element at index 1, and then call increment_aligned_block(&arr[A/2 + 1], 29). Again, left as an exercise to the reader.


I'd like to note this is not the most optimal it can be.

Unaligned accesses are usually pretty expensive. That is, reading an 8-byte value from an 8-byte aligned address is faster than from a non-aligned address. We can add additional optimizations to only call foo_aligned_block() such that all accesses are guaranteed to be aligned.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文