如果多次读取而不卸载内核对象，Arm-v8 PMCCNTR_EL0 将返回 0

发布于 2025-01-10 18:58:17 字数 3230 浏览 9 评论 0 原文

我有一个具有多个 A72 核心的 CPU。

我正在尝试对算法进行基准测试，并且想计算线程执行期间经过的核心周期数。

我交叉编译了两个内核对象以正确配置寄存器以访问 PMCCNTR_EL0 ： https://github.com/rdolbeau/enable_arm_pmu

https://github.com/jerinjacobk/armv8_pmu_cycle_counter_el0

显然，两者都应该做相同的事情，所以我一次只加载一个，我已经编译了两者，因为我还没有找到完美工作的解决方案目前。

这是我试图测试的代码（例如目的，只是为了尝试寄存器读取）。

#define _GNU_SOURCE
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <unistd.h>
#include <pthread.h>
#include <sched.h> 
#include "armpmu_lib.h"

uint64_t tmp = 35000;
uint64_t t0_start = 0;
uint64_t t0_stop = 0;
uint64_t t1_start = 0;
uint64_t t1_stop = 0;
uint64_t t2_start = 0;
uint64_t t2_stop = 0;

void * thread_1(){
    //Set core affinity and priority
    cpu_set_t my_set;
    CPU_ZERO(&my_set);
    CPU_SET(1,&my_set);
    sched_setaffinity(0,sizeof(cpu_set_t),&my_set);
    struct sched_param param= {
        .sched_priority=99
    };
    sched_setscheduler(0,SCHED_FIFO,&param);
    sleep(1);
    //Bench
    asm volatile("mrs %0, PMCCNTR_EL0" : "=r"(t1_start));
    for(int i=0; i<4000; i++){
        tmp+=1;
        //printf("Thread 1\n");
    }
    asm volatile("mrs %0, PMCCNTR_EL0" : "=r"(t1_stop));
    return NULL;
}

void * thread_2(){
    //Set core affinity and priority
    cpu_set_t my_set;
    CPU_ZERO(&my_set);
    CPU_SET(8,&my_set);
    sched_setaffinity(0,sizeof(cpu_set_t),&my_set);
    struct sched_param param= {
        .sched_priority=0
    };
    sched_setscheduler(0,SCHED_FIFO,&param);
    //Bench
    sleep(1);
    asm volatile("mrs %0, PMCCNTR_EL0" : "=r"(t2_start));
    for(int i=0; i<4000; i++){
        //printf("Thread 2\n");
        tmp+=5;
    }
    asm volatile("mrs %0, PMCCNTR_EL0" : "=r"(t2_stop));
    return NULL;
}

int main(){
    //Get the starting point cycle number
    asm volatile("mrs %0, PMCCNTR_EL0" : "=r"(t0_start));

    //Creates threads
    pthread_t thread_id_1;
    pthread_t thread_id_2;
    pthread_create(&thread_id_1, NULL, thread_1, NULL);
    pthread_create(&thread_id_2, NULL, thread_2, NULL);

    //Wait termination
    pthread_join(thread_id_1, NULL);
    pthread_join(thread_id_2, NULL);
    
    //Read number of cycles at the end of execution
    asm volatile("mrs %0, PMCCNTR_EL0" : "=r"(t0_stop));
    
    printf("T0 Execution cycles : %lu\n",t0_stop - t0_start); //Main thread number of cycles
    printf("T1 Execution cycles : %lu\n",t1_stop - t1_start); //Thread 1 number of cycles
    printf("T2 Execution cycles : %lu\n",t2_stop - t2_start); //Thread 2 number of cycles
        
    return 0;
}

当我使用此内核模块时：enable_arm_pmu

如果未加载，则会出现非法指令错误，这是预期的当我运行存储库上给出的测试代码时，它可以正常工作（我有一致的非零值）。如果加载，那么我运行我的代码一次，主线程有极值（FFFFFFFFFFDDA4A0 或 O），而其余线程的值似乎是正确的（10 到 25us 之间）。

但是，如果我在没有卸载、重新加载内核模块的情况下多次运行我的工作台，则线程 1 和线程 2 的所有后续执行测量为 0 个周期。

我是否遗漏了寄存器配置中的某些点？

当使用armv8_pmu_cycle_counter_el0内核对象时，主线程的周期数值似乎是正确的（5到10毫秒），但是两个线程都返回0个执行周期。

原文

I have a cpu that have multiple A72 cores.

I am trying to bench an algorithm and I want to count the number of core cycles that elapsed during the execution of a thread.

I've cross-compiled two kernel objects to configure properly the registers in order to access PMCCNTR_EL0 :
https://github.com/rdolbeau/enable_arm_pmu

https://github.com/jerinjacobk/armv8_pmu_cycle_counter_el0

Obviously, both should do the same stuff so I only load one at a time, I have compiled both because I haven't found a solution that works perfectly at the time being.

Here is the code I am trying to bench (for example purpose, just to try the register read).

#define _GNU_SOURCE
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <unistd.h>
#include <pthread.h>
#include <sched.h> 
#include "armpmu_lib.h"

uint64_t tmp = 35000;
uint64_t t0_start = 0;
uint64_t t0_stop = 0;
uint64_t t1_start = 0;
uint64_t t1_stop = 0;
uint64_t t2_start = 0;
uint64_t t2_stop = 0;

void * thread_1(){
    //Set core affinity and priority
    cpu_set_t my_set;
    CPU_ZERO(&my_set);
    CPU_SET(1,&my_set);
    sched_setaffinity(0,sizeof(cpu_set_t),&my_set);
    struct sched_param param= {
        .sched_priority=99
    };
    sched_setscheduler(0,SCHED_FIFO,¶m);
    sleep(1);
    //Bench
    asm volatile("mrs %0, PMCCNTR_EL0" : "=r"(t1_start));
    for(int i=0; i<4000; i++){
        tmp+=1;
        //printf("Thread 1\n");
    }
    asm volatile("mrs %0, PMCCNTR_EL0" : "=r"(t1_stop));
    return NULL;
}

void * thread_2(){
    //Set core affinity and priority
    cpu_set_t my_set;
    CPU_ZERO(&my_set);
    CPU_SET(8,&my_set);
    sched_setaffinity(0,sizeof(cpu_set_t),&my_set);
    struct sched_param param= {
        .sched_priority=0
    };
    sched_setscheduler(0,SCHED_FIFO,¶m);
    //Bench
    sleep(1);
    asm volatile("mrs %0, PMCCNTR_EL0" : "=r"(t2_start));
    for(int i=0; i<4000; i++){
        //printf("Thread 2\n");
        tmp+=5;
    }
    asm volatile("mrs %0, PMCCNTR_EL0" : "=r"(t2_stop));
    return NULL;
}

int main(){
    //Get the starting point cycle number
    asm volatile("mrs %0, PMCCNTR_EL0" : "=r"(t0_start));

    //Creates threads
    pthread_t thread_id_1;
    pthread_t thread_id_2;
    pthread_create(&thread_id_1, NULL, thread_1, NULL);
    pthread_create(&thread_id_2, NULL, thread_2, NULL);

    //Wait termination
    pthread_join(thread_id_1, NULL);
    pthread_join(thread_id_2, NULL);
    
    //Read number of cycles at the end of execution
    asm volatile("mrs %0, PMCCNTR_EL0" : "=r"(t0_stop));
    
    printf("T0 Execution cycles : %lu\n",t0_stop - t0_start); //Main thread number of cycles
    printf("T1 Execution cycles : %lu\n",t1_stop - t1_start); //Thread 1 number of cycles
    printf("T2 Execution cycles : %lu\n",t2_stop - t2_start); //Thread 2 number of cycles
        
    return 0;
}

When I use this kernel module : enable_arm_pmu

If not loaded, I have an illegal instruction error, that's expected
When I run the test code given on the repo, It works correctly (I have consistent non zero values).
If loaded then I run my code once, I have extreme values (FFFFFFFFFFDDA4A0 or O) for Main thread and values that seems correct for the rest of the threads (between 10 and 25us).

However, If I run my bench several times without unloading, reloading the kernel module, all the following executions measures 0 cycles for Thread 1 and Thread 2.

Am I missing some point in the configuration of the registers ?

When using armv8_pmu_cycle_counter_el0 kernel object, the value of the number of cycles for main thread seems correct (5 to 10 ms) however both threads returns 0 cycles of execution.

分享到QQ

分享到微博