在Python中开发KneighboursClassifier分析时遇到困难

发布于 2025-02-09 03:19:01 字数 2512 浏览 0 评论 0原文

我正在尝试使用Jupyter的Python的Kneighboursclassifier生产常规。我的目标是将多样性值分为4种类型的水质量，但是当我测试代码时，“ Dead nead bernel”出现在我的Jupyter页面上。

我想产生一个类似的数字：

“显示分类”

仅将其调整为我的数据。那是我正在处理的代码：

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.cm as cm
from matplotlib.colors import ListedColormap, BoundaryNorm
import matplotlib.patches as mpatches
import matplotlib.patches as mpatches
from sklearn import neighbors, datasets
from sklearn.neighbors import KNeighborsClassifier

index = pd.read_excel('diverty_index.xlsx') #This is my data set

X = index[['Shannon', 'Depth']]  
y = index['Water_mass']

# default is 75% / 25% train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

def plot_water_knn(X, y, n_neighbors, weights):
    X_mat = X[['Shannon', 'Depth']].values #Shannon is a diversity index
    y_mat = y.values
# Create color maps
    cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF','#AFAFAF'])
    cmap_bold  = ListedColormap(['#FF0000', '#00FF00', '#0000FF','#AFAFAF'])
    clf = neighbors.KNeighborsClassifier(n_neighbors, weights=weights)
    clf.fit(X_mat, y_mat)
# Plot the decision boundary by assigning a color in the color map
    # to each mesh point.

    mesh_step_size = .01  # step size in the mesh
    plot_symbol_size = 50

    x_min, x_max = X_mat[:, 0].min() - 1, X_mat[:, 0].max() + 1
    y_min, y_max = X_mat[:, 1].min() - 1, X_mat[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, mesh_step_size),
                         np.arange(y_min, y_max, mesh_step_size))
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.figure()
    plt.pcolormesh(xx, yy, Z, cmap=cmap_light)
# Plot training points
    plt.scatter(X_mat[:, 0], X_mat[:, 1], s=plot_symbol_size, c=y, cmap=cmap_bold, edgecolor = 'black')
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())
    patch0 = mpatches.Patch(color='#FF0000', label='AASW')
    patch1 = mpatches.Patch(color='#00FF00', label='CDW')
    patch2 = mpatches.Patch(color='#0000FF', label='MWDW')
    patch3 = mpatches.Patch(color='#AFAFAF', label='AABW')
    plt.legend(handles=[patch0, patch1, patch2, patch3])
plt.xlabel('Shannon H')
plt.ylabel('Profundidade(m)')
plt.show()
plot_water_knn(X_train, y_train, 5, 'uniform')```

原文

I'm trying to produce a routine using KNeighboursClassifier in Python in Jupyter. My goal is to group the diversity values into 4 types of water masses, but when I test my code, ''Dead Kernel'' appears on my Jupyter page.

I want to produce a figure similar to this:

Image showing classification

only adapting it to my data. That's the code I'm working on:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.cm as cm
from matplotlib.colors import ListedColormap, BoundaryNorm
import matplotlib.patches as mpatches
import matplotlib.patches as mpatches
from sklearn import neighbors, datasets
from sklearn.neighbors import KNeighborsClassifier

index = pd.read_excel('diverty_index.xlsx') #This is my data set

X = index[['Shannon', 'Depth']]  
y = index['Water_mass']

# default is 75% / 25% train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

def plot_water_knn(X, y, n_neighbors, weights):
    X_mat = X[['Shannon', 'Depth']].values #Shannon is a diversity index
    y_mat = y.values
# Create color maps
    cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF','#AFAFAF'])
    cmap_bold  = ListedColormap(['#FF0000', '#00FF00', '#0000FF','#AFAFAF'])
    clf = neighbors.KNeighborsClassifier(n_neighbors, weights=weights)
    clf.fit(X_mat, y_mat)
# Plot the decision boundary by assigning a color in the color map
    # to each mesh point.

    mesh_step_size = .01  # step size in the mesh
    plot_symbol_size = 50

    x_min, x_max = X_mat[:, 0].min() - 1, X_mat[:, 0].max() + 1
    y_min, y_max = X_mat[:, 1].min() - 1, X_mat[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, mesh_step_size),
                         np.arange(y_min, y_max, mesh_step_size))
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.figure()
    plt.pcolormesh(xx, yy, Z, cmap=cmap_light)
# Plot training points
    plt.scatter(X_mat[:, 0], X_mat[:, 1], s=plot_symbol_size, c=y, cmap=cmap_bold, edgecolor = 'black')
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())
    patch0 = mpatches.Patch(color='#FF0000', label='AASW')
    patch1 = mpatches.Patch(color='#00FF00', label='CDW')
    patch2 = mpatches.Patch(color='#0000FF', label='MWDW')
    patch3 = mpatches.Patch(color='#AFAFAF', label='AABW')
    plt.legend(handles=[patch0, patch1, patch2, patch3])
plt.xlabel('Shannon H')
plt.ylabel('Profundidade(m)')
plt.show()
plot_water_knn(X_train, y_train, 5, 'uniform')```

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

故事还在继续 2025-02-16 03:19:01

我认为DEAD内核问题是在较大的功能空间上使用细网格（mesh_step_size）的结果。标准化数据将有所帮助，并应改进模型。如果不是这样，您的整个数据集可能对您的机器来说太大了。

但是，一阶问题是，此代码有点混乱，将建模和绘制在半写的功能中混合。让我们简化所有内容，然后从分类器开始。忘了现在的情节。

制作分类器

我使用事物的常规名称进行了重构。尝试以下尝试：

import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_excel('diverty_index.xlsx')

X = df[['Shannon', 'Depth']]  
y = df['Water_mass']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

请注意，重要的是您的样品不在“团块”中，例如，位置a在不同深度的5个测量值，然后在不同深度的位置b，在不同的深度等上进行4个测量等等。不能仅仅用train_test_split随机拆分；相反，您需要将团块分开（例如位置）。

现在您需要扩展数据。该分类器取决于距离，并且距离如果您的功能是不同的量表，则距离并不意味着太多。所以这样做：

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)  # Do not fit scaler to Test.

现在您可以适合分类器：

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report

knn = KNeighborsClassifier(n_neighbors=5, weights='uniform')
knn.fit(X_train, y_train)
y_pred = knn.predict(y_test)
print(classification_report(y_test, y_pred)

看起来如何？加权F1得分是多少？如果更改n_neighbors的值，它会改善吗？您如何制作此模型？（严格来说，您应该还有另一个盲目数据集来测试所有这些数据，但这是一个细节。）

如果您使用完整的内核进行了远，那么您可以对拥有一些有用的KNN模型感到满意，并且您可以继续前进数据。

决策区域可视化

我怀疑这条线正在杀死您的内核：mesh_step_size = .01，因为如果x的功能具有较大的范围（例如0至10,000），则网眼将是巨大的，吃了您的记忆，撞击了内核。但是，既然我们已经标准化了数据，那么事情就可以预测了，因为大多数值将在-3至+3范围内。

这种简约的方法改编自：

fig, ax = plt.subplots()

# Plot the validation data.
ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test)

# Set up the grid parameters.
h = .02
x_min, x_max = X_train[:, 0].min() - .5, X_train[:, 0].max() + .5
y_min, y_max = X_train[:, 1].min() - .5, X_train[:, 1].max() + .5

# Make the grid.
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

# Predict on the grid.
Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
    
# Put the result into a color plot.
Z = Z.reshape(xx.shape)
im = ax.pcolormesh(xx, yy, Z, alpha=0.2, zorder=1, shading='auto')

如果这至少会产生一些东西，那么您将参加比赛。我确定您可以添加标题，轴注释等，并使其变得漂亮。祝你好运。

I think the dead kernel issue is a result of using a fine mesh (mesh_step_size) on a large feature space. Standardizing your data will help, and should improve the model. If it doesn't, your entire dataset might be too large for your machine.

But the first-order problem is that this code is a bit jumbled up, mixing modeling and plotting in a ort of half-written function. Let's simplify everything and start with the classifier. Forget the plotting for now.

Making a classifier

I refactored a bit, using conventional names for things. Try this:

import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_excel('diverty_index.xlsx')

X = df[['Shannon', 'Depth']]  
y = df['Water_mass']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

Note here that it's important that your samples are not in 'clumps', e.g. location A with 5 measurements at various depths, then location B with 4 measurements at different depths, etc. If they are 'clumped' like this, you can't just split randomly with train_test_split; instead you need to split the clumps (e.g. the locations).

Now you need to scale your data. This classifier depends on distances, and distances don't mean much if your features are different scales. So do this:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)  # Do not fit scaler to Test.

Now you can fit the classifier:

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report

knn = KNeighborsClassifier(n_neighbors=5, weights='uniform')
knn.fit(X_train, y_train)
y_pred = knn.predict(y_test)
print(classification_report(y_test, y_pred)

How does this look? What's the weighted F1 score? Does it improve if you change the value of n_neighbors? How good can you make this model? (Strictly speaking you should have another blind dataset to test all this, but that's a detail.)

If you got this far with an intact kernel, then you can feel good about having a somewhat useful KNN model, and you can move on to the data viz.

Decision region visualization

I suspect that this line was killing your kernel: mesh_step_size = .01, because if the features of X have a large range (e.g. 0 to 10,000), the mesh will be gigantic and eat your memory, crashing the kernel. But now that we've standardized the data, things are more predictable because most values will be in the range -3 to +3.

This minimalist approach adapted from this famous plot should produce something:

fig, ax = plt.subplots()

# Plot the validation data.
ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test)

# Set up the grid parameters.
h = .02
x_min, x_max = X_train[:, 0].min() - .5, X_train[:, 0].max() + .5
y_min, y_max = X_train[:, 1].min() - .5, X_train[:, 1].max() + .5

# Make the grid.
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

# Predict on the grid.
Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
    
# Put the result into a color plot.
Z = Z.reshape(xx.shape)
im = ax.pcolormesh(xx, yy, Z, alpha=0.2, zorder=1, shading='auto')

If this produces something at least, then you're off to the races. I'm sure you can add a title, axis annotation, etc, and make it pretty. Good luck.

回复收藏 0 原文

~没有更多了~