如何使用 scipy 计算数组的概率分布
我的实际目标是计算两个直方图之间的差异。为此,我想使用 Kullback-Leibler-Divergenz。在这个线程Calculate KL Divergence in Python中据说Scipy的熵函数将计算 KL 散度。为此,我需要数据集的概率分布。我尝试遵循这两个线程中给出的答案和说明 如何在Python中计算PDF(概率密度函数)?和如何有效计算给定数据集的pdf 。不幸的是我总是收到错误。
在这里您可以看到我的代码,其中我将数据细分为 3 个部分(训练、验证和测试数据集),并旨在计算这 3 个集合的数据分布之间的成对差异。
import scipy
from scipy.stats import norm
from scipy.stats import rv_histogram
import numpy as np
import pandas as pd
#Reading the data
df = pd.read_csv("C:/Users/User1/Desktop/TestData_Temperatures.csv", sep=';')
#Build training, validation and test data set
timeslot_x_train_end = int(len(df)* 0.7)
timeslot_x_valid_end = int(len(df)* 0.9)
data_histogram = df[['temperatures']].values
data_train_histogram = data_histogram [:timeslot_x_train_end]
data_valid_histogram = data_histogram [timeslot_x_train_end:timeslot_x_valid_end]
data_test_histogram = data_histogram [timeslot_x_valid_end:]
#Make histogram out of numpy array
histogram_train = rv_histogram(np.histogram(data_train_histogram, bins='auto'))
histogram_valid = rv_histogram(np.histogram(data_valid_histogram, bins='auto'))
histogram_test = rv_histogram(np.histogram(data_test_histogram, bins='auto'))
#Make probability distribution out of the histogram
pdfs_train = norm.cdf(histogram_train, histogram_train.mean(), histogram_train.std())
pdfs_valid = norm.cdf(histogram_valid, histogram_valid.mean(), histogram_valid.std())
pdfs_test = norm.cdf(histogram_test, histogram_valid.mean(), histogram_valid.std())
#Calculate the entropy between the different datasets
entropy_train_valid = scipy.special.rel_entr(pdfs_train, pdfs_valid)
entropy_train_test = scipy.special.rel_entr(pdfs_train, pdfs_test)
entropy_valid_test = scipy.special.rel_entr(pdfs_valid, pdfs_test)
#Calculate the Kullback–Leibler divergence between the different datasets
kl_div_train_valid = np.sum(entropy_train_valid)
kl_div_train_test = np.sum(entropy_train_test)
kl_div_valid_test = np.sum(entropy_valid_test)
#Print the values of the Kullback–Leibler divergence
print(f"Kullback–Leibler divergence between training and validation dataset: {kl_div_train_valid}")
print(f"Kullback–Leibler divergence between training and test dataset: {kl_div_train_test}")
print(f"Kullback–Leibler divergence between validation and test dataset: {kl_div_valid_test}")
在此设置中,我收到行 pdfs_train =norm.cdf(histogram_train, histogram_train.mean(), histogram_train.std())
。在这里您可以看到测试数据集TestData。您知道为什么会出现此错误以及如何计算数组的概率分布(以及最终的 Kullback-Leibler 散度)吗?
提醒:有什么意见吗?我会很感激每一条评论。
my actual goal is to calculate the difference between two histograms. For this I would like to use the Kullback-Leibler-Divergenz. In this thread Calculating KL Divergence in Python it was said that Scipy's entropy function will calculate KL divergence. For this I need a probability distribution of my datasets. I tried to follow the answers and instructions given in those 2 threads How do I calculate PDF (probability density function) in Python? and How to effectively compute the pdf of a given dataset. Unfortunately I always get an error.
Here you can see my code in which I subdivide the data into 3 parts (training, validation and test dataset) and aim to calculate the pairwise-difference between the data distribution of those 3 sets.
import scipy
from scipy.stats import norm
from scipy.stats import rv_histogram
import numpy as np
import pandas as pd
#Reading the data
df = pd.read_csv("C:/Users/User1/Desktop/TestData_Temperatures.csv", sep=';')
#Build training, validation and test data set
timeslot_x_train_end = int(len(df)* 0.7)
timeslot_x_valid_end = int(len(df)* 0.9)
data_histogram = df[['temperatures']].values
data_train_histogram = data_histogram [:timeslot_x_train_end]
data_valid_histogram = data_histogram [timeslot_x_train_end:timeslot_x_valid_end]
data_test_histogram = data_histogram [timeslot_x_valid_end:]
#Make histogram out of numpy array
histogram_train = rv_histogram(np.histogram(data_train_histogram, bins='auto'))
histogram_valid = rv_histogram(np.histogram(data_valid_histogram, bins='auto'))
histogram_test = rv_histogram(np.histogram(data_test_histogram, bins='auto'))
#Make probability distribution out of the histogram
pdfs_train = norm.cdf(histogram_train, histogram_train.mean(), histogram_train.std())
pdfs_valid = norm.cdf(histogram_valid, histogram_valid.mean(), histogram_valid.std())
pdfs_test = norm.cdf(histogram_test, histogram_valid.mean(), histogram_valid.std())
#Calculate the entropy between the different datasets
entropy_train_valid = scipy.special.rel_entr(pdfs_train, pdfs_valid)
entropy_train_test = scipy.special.rel_entr(pdfs_train, pdfs_test)
entropy_valid_test = scipy.special.rel_entr(pdfs_valid, pdfs_test)
#Calculate the Kullback–Leibler divergence between the different datasets
kl_div_train_valid = np.sum(entropy_train_valid)
kl_div_train_test = np.sum(entropy_train_test)
kl_div_valid_test = np.sum(entropy_valid_test)
#Print the values of the Kullback–Leibler divergence
print(f"Kullback–Leibler divergence between training and validation dataset: {kl_div_train_valid}")
print(f"Kullback–Leibler divergence between training and test dataset: {kl_div_train_test}")
print(f"Kullback–Leibler divergence between validation and test dataset: {kl_div_valid_test}")
In this setup I get the error message "TypeError: unsupported operand type(s) for -: 'rv_histogram' and 'float'" thrown by the line pdfs_train = norm.cdf(histogram_train, histogram_train.mean(), histogram_train.std())
. Here you can see the test dataset TestData. Do you have an idea why I get this error and how I can calculate the probability distribution from the arrays (and eventually the Kullback–Leibler divergence)?
Reminder: Any comments? I'll appreciate every comment.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
在样本足以捕获总体分布的假设下,样本的直方图可以近似分布的 pdf。因此,当您使用
histogram_train = rv_histogram(np.histogram(data_train_histogram, bins='auto'))
时,它会生成由直方图给出的分布。它有一个.pdf
方法来评估 pdf,还有.rvs
来生成遵循此分布的值。因此,要计算两个分布之间的 Kullback-Leibler 散度,您可以执行以下操作:另一方面,如果您假设数据呈正态分布,则必须执行以下操作:
An histogram of a sample can approximate the pdf of the distribution under the assumption that the sample is enough to capture the distribution of the population. So when you use
histogram_train = rv_histogram(np.histogram(data_train_histogram, bins='auto'))
it generates a distribution given by a histogram. It has a.pdf
method to evaluate the pdf and also.rvs
to generate values that follow this distribution. So to calculate the Kullback–Leibler divergence between two distributions you can do the following:On the other hand, if you assume that the data have a normal distribution then you must do the following: