如何使用 scipy 计算数组的概率分布

发布于 2025-01-09 09:24:52 字数 2989 浏览 2 评论 0原文

我的实际目标是计算两个直方图之间的差异。为此，我想使用 Kullback-Leibler-Divergenz。在这个线程Calculate KL Divergence in Python中据说Scipy的熵函数将计算 KL 散度。为此，我需要数据集的概率分布。我尝试遵循这两个线程中给出的答案和说明如何在Python中计算PDF（概率密度函数）？和如何有效计算给定数据集的pdf 。不幸的是我总是收到错误。

在这里您可以看到我的代码，其中我将数据细分为 3 个部分（训练、验证和测试数据集），并旨在计算这 3 个集合的数据分布之间的成对差异。

import scipy
from scipy.stats import norm
from scipy.stats import rv_histogram
import numpy as np
import pandas as pd



#Reading the data
df = pd.read_csv("C:/Users/User1/Desktop/TestData_Temperatures.csv", sep=';')

#Build training, validation and test data set      
timeslot_x_train_end = int(len(df)* 0.7)
timeslot_x_valid_end = int(len(df)* 0.9)

data_histogram = df[['temperatures']].values
data_train_histogram = data_histogram [:timeslot_x_train_end]
data_valid_histogram = data_histogram [timeslot_x_train_end:timeslot_x_valid_end]
data_test_histogram = data_histogram [timeslot_x_valid_end:]

#Make histogram out of numpy array
histogram_train = rv_histogram(np.histogram(data_train_histogram, bins='auto'))
histogram_valid = rv_histogram(np.histogram(data_valid_histogram, bins='auto'))
histogram_test = rv_histogram(np.histogram(data_test_histogram, bins='auto'))

#Make probability distribution out of the histogram
pdfs_train = norm.cdf(histogram_train, histogram_train.mean(), histogram_train.std())
pdfs_valid = norm.cdf(histogram_valid, histogram_valid.mean(), histogram_valid.std())
pdfs_test = norm.cdf(histogram_test, histogram_valid.mean(), histogram_valid.std())

#Calculate the entropy between the different datasets
entropy_train_valid = scipy.special.rel_entr(pdfs_train, pdfs_valid)   
entropy_train_test = scipy.special.rel_entr(pdfs_train, pdfs_test) 
entropy_valid_test = scipy.special.rel_entr(pdfs_valid, pdfs_test) 

#Calculate the Kullback–Leibler divergence between the different datasets
kl_div_train_valid = np.sum(entropy_train_valid)
kl_div_train_test = np.sum(entropy_train_test)
kl_div_valid_test = np.sum(entropy_valid_test)

#Print the values of the Kullback–Leibler divergence
print(f"Kullback–Leibler divergence between training and validation dataset: {kl_div_train_valid}")
print(f"Kullback–Leibler divergence between training and test dataset: {kl_div_train_test}")
print(f"Kullback–Leibler divergence between validation and test dataset: {kl_div_valid_test}")

在此设置中，我收到行 pdfs_train =norm.cdf(histogram_train, histogram_train.mean(), histogram_train.std())。在这里您可以看到测试数据集TestData。您知道为什么会出现此错误以及如何计算数组的概率分布（以及最终的 Kullback-Leibler 散度）吗？

提醒：有什么意见吗？我会很感激每一条评论。

原文

my actual goal is to calculate the difference between two histograms. For this I would like to use the Kullback-Leibler-Divergenz. In this thread Calculating KL Divergence in Python it was said that Scipy's entropy function will calculate KL divergence. For this I need a probability distribution of my datasets. I tried to follow the answers and instructions given in those 2 threads How do I calculate PDF (probability density function) in Python? and How to effectively compute the pdf of a given dataset. Unfortunately I always get an error.

Here you can see my code in which I subdivide the data into 3 parts (training, validation and test dataset) and aim to calculate the pairwise-difference between the data distribution of those 3 sets.

import scipy
from scipy.stats import norm
from scipy.stats import rv_histogram
import numpy as np
import pandas as pd



#Reading the data
df = pd.read_csv("C:/Users/User1/Desktop/TestData_Temperatures.csv", sep=';')

#Build training, validation and test data set      
timeslot_x_train_end = int(len(df)* 0.7)
timeslot_x_valid_end = int(len(df)* 0.9)

data_histogram = df[['temperatures']].values
data_train_histogram = data_histogram [:timeslot_x_train_end]
data_valid_histogram = data_histogram [timeslot_x_train_end:timeslot_x_valid_end]
data_test_histogram = data_histogram [timeslot_x_valid_end:]

#Make histogram out of numpy array
histogram_train = rv_histogram(np.histogram(data_train_histogram, bins='auto'))
histogram_valid = rv_histogram(np.histogram(data_valid_histogram, bins='auto'))
histogram_test = rv_histogram(np.histogram(data_test_histogram, bins='auto'))

#Make probability distribution out of the histogram
pdfs_train = norm.cdf(histogram_train, histogram_train.mean(), histogram_train.std())
pdfs_valid = norm.cdf(histogram_valid, histogram_valid.mean(), histogram_valid.std())
pdfs_test = norm.cdf(histogram_test, histogram_valid.mean(), histogram_valid.std())

#Calculate the entropy between the different datasets
entropy_train_valid = scipy.special.rel_entr(pdfs_train, pdfs_valid)   
entropy_train_test = scipy.special.rel_entr(pdfs_train, pdfs_test) 
entropy_valid_test = scipy.special.rel_entr(pdfs_valid, pdfs_test) 

#Calculate the Kullback–Leibler divergence between the different datasets
kl_div_train_valid = np.sum(entropy_train_valid)
kl_div_train_test = np.sum(entropy_train_test)
kl_div_valid_test = np.sum(entropy_valid_test)

#Print the values of the Kullback–Leibler divergence
print(f"Kullback–Leibler divergence between training and validation dataset: {kl_div_train_valid}")
print(f"Kullback–Leibler divergence between training and test dataset: {kl_div_train_test}")
print(f"Kullback–Leibler divergence between validation and test dataset: {kl_div_valid_test}")

In this setup I get the error message "TypeError: unsupported operand type(s) for -: 'rv_histogram' and 'float'" thrown by the line pdfs_train = norm.cdf(histogram_train, histogram_train.mean(), histogram_train.std()). Here you can see the test dataset TestData. Do you have an idea why I get this error and how I can calculate the probability distribution from the arrays (and eventually the Kullback–Leibler divergence)?

Reminder: Any comments? I'll appreciate every comment.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

七色彩虹 2025-01-16 09:24:52

在样本足以捕获总体分布的假设下，样本的直方图可以近似分布的 pdf。因此，当您使用 histogram_train = rv_histogram(np.histogram(data_train_histogram, bins='auto')) 时，它会生成由直方图给出的分布。它有一个 .pdf 方法来评估 pdf，还有 .rvs 来生成遵循此分布的值。因此，要计算两个分布之间的 Kullback-Leibler 散度，您可以执行以下操作：

#Reading the data
df = pd.read_csv("C:/Users/User1/Desktop/TestData_Temperatures.csv", sep=';')

#Build training, validation   
timeslot_x_train_end = int(len(df)* 0.7)
timeslot_x_valid_end = int(len(df)* 0.9)

data = df[['temperatures']].values
data_train = data[:timeslot_x_train_end]
data_valid = data[timeslot_x_train_end:timeslot_x_valid_end]

#Make distribution objects of the histograms
histogram_dist_train = rv_histogram(np.histogram(data_train, bins='auto'))
histogram_dist_valid = rv_histogram(np.histogram(data_valid, bins='auto'))

#Generate arrays of pdf evaluations
X1 = np.linspace(np.min(data_train), np.max(data_train), 1000)
X2 = np.linspace(np.min(data_valid), np.max(data_valid), 1000)
rvs_train = [histogram_dist_train.pdf(x) for x in X1]
rvs_valid = [histogram_dist_valid.pdf(x) for x in X2]

#Calculate the Kullback–Leibler divergence between the different datasets
entropy_train_valid = scipy.special.rel_entr(rvs_train, rvs_valid)   
kl_div_train_valid = np.sum(entropy_train_valid)

#Print the values of the Kullback–Leibler divergence
print(f"Kullback–Leibler divergence between training and validation dataset: {kl_div_train_valid}")

另一方面，如果您假设数据呈正态分布，则必须执行以下操作：

#Build training, validation   
timeslot_x_train_end = int(len(df)* 0.7)
timeslot_x_valid_end = int(len(df)* 0.9)

data = df[['temperatures']].values
data_train = data[:timeslot_x_train_end]
data_valid = data[timeslot_x_train_end:timeslot_x_valid_end]

#Make normal distribution objects from data mean and standard deviation
norm_dist_train = norm(data_train.mean(), data_train.std())
norm_dist_valid = norm(data_valid.mean(), data_valid.std())

#Generate arrays of pdf evaluations
X1 = np.linspace(np.min(data_train), np.max(data_train), 1000)
X2 = np.linspace(np.min(data_valid), np.max(data_valid), 1000)
rvs_train = [norm_dist_train.pdf(x) for x in X1]
rvs_valid = [norm_dist_valid.pdf(x) for x in X2]

#Calculate the Kullback–Leibler divergence between the different datasets
entropy_train_valid = scipy.special.rel_entr(rvs_train, rvs_valid)       
kl_div_train_valid = np.sum(entropy_train_valid)

#Print the values of the Kullback–Leibler divergence
print(f"Kullback–Leibler divergence between training and validation dataset: {kl_div_train_valid}")

An histogram of a sample can approximate the pdf of the distribution under the assumption that the sample is enough to capture the distribution of the population. So when you use histogram_train = rv_histogram(np.histogram(data_train_histogram, bins='auto')) it generates a distribution given by a histogram. It has a .pdf method to evaluate the pdf and also .rvs to generate values that follow this distribution. So to calculate the Kullback–Leibler divergence between two distributions you can do the following:

#Reading the data
df = pd.read_csv("C:/Users/User1/Desktop/TestData_Temperatures.csv", sep=';')

#Build training, validation   
timeslot_x_train_end = int(len(df)* 0.7)
timeslot_x_valid_end = int(len(df)* 0.9)

data = df[['temperatures']].values
data_train = data[:timeslot_x_train_end]
data_valid = data[timeslot_x_train_end:timeslot_x_valid_end]

#Make distribution objects of the histograms
histogram_dist_train = rv_histogram(np.histogram(data_train, bins='auto'))
histogram_dist_valid = rv_histogram(np.histogram(data_valid, bins='auto'))

#Generate arrays of pdf evaluations
X1 = np.linspace(np.min(data_train), np.max(data_train), 1000)
X2 = np.linspace(np.min(data_valid), np.max(data_valid), 1000)
rvs_train = [histogram_dist_train.pdf(x) for x in X1]
rvs_valid = [histogram_dist_valid.pdf(x) for x in X2]

#Calculate the Kullback–Leibler divergence between the different datasets
entropy_train_valid = scipy.special.rel_entr(rvs_train, rvs_valid)   
kl_div_train_valid = np.sum(entropy_train_valid)

#Print the values of the Kullback–Leibler divergence
print(f"Kullback–Leibler divergence between training and validation dataset: {kl_div_train_valid}")

On the other hand, if you assume that the data have a normal distribution then you must do the following:

#Build training, validation   
timeslot_x_train_end = int(len(df)* 0.7)
timeslot_x_valid_end = int(len(df)* 0.9)

data = df[['temperatures']].values
data_train = data[:timeslot_x_train_end]
data_valid = data[timeslot_x_train_end:timeslot_x_valid_end]

#Make normal distribution objects from data mean and standard deviation
norm_dist_train = norm(data_train.mean(), data_train.std())
norm_dist_valid = norm(data_valid.mean(), data_valid.std())

#Generate arrays of pdf evaluations
X1 = np.linspace(np.min(data_train), np.max(data_train), 1000)
X2 = np.linspace(np.min(data_valid), np.max(data_valid), 1000)
rvs_train = [norm_dist_train.pdf(x) for x in X1]
rvs_valid = [norm_dist_valid.pdf(x) for x in X2]

#Calculate the Kullback–Leibler divergence between the different datasets
entropy_train_valid = scipy.special.rel_entr(rvs_train, rvs_valid)       
kl_div_train_valid = np.sum(entropy_train_valid)

#Print the values of the Kullback–Leibler divergence
print(f"Kullback–Leibler divergence between training and validation dataset: {kl_div_train_valid}")

回复收藏 0 原文

~没有更多了~