使用 databricks-connect 在工作区之间切换

发布于 2025-01-16 19:13:14 字数 213 浏览 3 评论 0原文

是否可以使用 databricks-connect 切换工作区?

我目前正在尝试切换: spark.conf.set('spark.driver.host', cluster_config['host'])

但这会返回以下错误: AnalysisException:无法修改 Spark 配置的值:spark.driver.host

Is it possible to switch workspace with the use of databricks-connect?

I'm currently trying to switch with: spark.conf.set('spark.driver.host', cluster_config['host'])

But this gives back the following error:
AnalysisException: Cannot modify the value of a Spark config: spark.driver.host

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

瑾夏年华 2025-01-23 19:13:14

如果您查看有关设置的文档client,然后您将看到配置 Databricks Connect 的三种方法:

  • 使用 databricks-connect configure 生成的配置文件 - 文件名始终为~/.databricks-connect
  • 环境变量 - DATABRICKS_ADDRESSDATABRICKS_API_TOKEN、...
  • Spark 配置属性 - spark.databricks.service .address, spark.databricks.service.token, ... 但是使用此方法时,Spark Session 可能已经初始化,因此您可能无法开启飞行,无需重新启动 Spark。

但如果您使用不同的 DBR 版本,那么仅更改配置属性是不够的,您还需要切换包含相应版本的 Databricks Connect 发行版的 Python 环境。

对于我自己的工作,我编写了以下 Zsh 脚本,该脚本允许在不同设置(分片)之间轻松切换 - 但它允许一次仅使用一个分片。先决条件是:

  • 使用名称 -shard 创建 Python 环境
  • databricks-connect 安装到激活的 conda 环境中:
pyenv activate field-eng-shard
pip install -U databricks-connect==<DBR-version>
  • 配置一次 databricks-connect,并配置特定集群/分片存储在 ~/.databricks-connect- 文件中,该文件将符号链接到 ~/.databricks-connect
function use-shard() {
    SHARD_NAME="$1"
    if [ -z "$SHARD_NAME" ]; then
        echo "Usage: use-shard shard-name"
        return 1
    fi
    if [ ! -L ~/.databricks-connect ] && [ -f ~/.databricks-connect ]; then
        echo "There is ~/.databricks-connect file - possibly you configured another shard"
    elif [ -f ~/.databricks-connect-${SHARD_NAME} ]; then
        rm -f ~/.databricks-connect
        ln -s ~/.databricks-connect-${SHARD_NAME} ~/.databricks-connect
        pyenv deactivate
        pyenv activate ${SHARD_NAME}-shard
    else
        echo "There is no configuration file for shard: ~/.databricks-connect-${SHARD_NAME}"
    fi
}

If you look into documentation on setting the client, then you will see that there are three methods to configure Databricks Connect:

  • Configuration file generated with databricks-connect configure - the file name is always ~/.databricks-connect,
  • Environment variables - DATABRICKS_ADDRESS, DATABRICKS_API_TOKEN, ...
  • Spark Configuration properties - spark.databricks.service.address, spark.databricks.service.token, ... But when using this method, Spark Session could be already initialized, so you may not able switch on the fly, without restarting Spark.

But if you use different DBR versions, then it's not enough to change configuration properties, you also need to switch Python environments that contains corresponding version of Databricks Connect distribution.

For my own work I wrote following Zsh script that allows easy switch between different setups (shards) - it allows to use only one shard at time although. Prerequisites are:

  • Python environment is created with name <name>-shard
  • databricks-connect is installed into activated conda environment with:
pyenv activate field-eng-shard
pip install -U databricks-connect==<DBR-version>
  • databricks-connect is configured once, and configuration for specific cluster/shard is stored in the ~/.databricks-connect-<name> file that will be symlinked to ~/.databricks-connect
function use-shard() {
    SHARD_NAME="$1"
    if [ -z "$SHARD_NAME" ]; then
        echo "Usage: use-shard shard-name"
        return 1
    fi
    if [ ! -L ~/.databricks-connect ] && [ -f ~/.databricks-connect ]; then
        echo "There is ~/.databricks-connect file - possibly you configured another shard"
    elif [ -f ~/.databricks-connect-${SHARD_NAME} ]; then
        rm -f ~/.databricks-connect
        ln -s ~/.databricks-connect-${SHARD_NAME} ~/.databricks-connect
        pyenv deactivate
        pyenv activate ${SHARD_NAME}-shard
    else
        echo "There is no configuration file for shard: ~/.databricks-connect-${SHARD_NAME}"
    fi
}
魔法唧唧 2025-01-23 19:13:14

我创建了一个简单的 python 脚本来更改 .databricks-connect 配置文件中的 cluster_id

要执行,请确保您的虚拟环境已配置环境变量 DATABRICKS_CLUSTER。 此处显示了获取集群 ID< /a> 在官方 databricks-connect 文档中。

设置环境变量:

export DATABRICKS_CLUSTER=your-cluster-id

设置环境变量后,只要激活新的虚拟环境,只需使用以下 python 脚本即可切换集群。

import os
import json

#Get databricks cluster associated with current virtual env
DATABRICKS_CLUSTER = os.getenv('DATABRICKS_CLUSTER')
HOME = os.getenv('HOME')

#Open the databricks-connect config file
with open(f'{HOME}/.databricks-connect', 'r') as j:
    config = json.loads(j.read())

#Update new cluster ID
config['cluster_id'] = DATABRICKS_CLUSTER

#Save the databricks connect config file
with open(f'{HOME}/.databricks-connect', 'w') as outfile:
    json.dump(config, outfile, indent=4)

I created a simple python script to change the cluster_id within the .databricks-connect configuration file.

To execute, ensure your virtual env has environment variable DATABRICKS_CLUSTER configured. Obtaining the cluster ID is shown here in the official databricks-connect documentation.

Set the environment variable with:

export DATABRICKS_CLUSTER=your-cluster-id

Once the environment variable is set, simply use the following python script to switch cluster whenever your new virtual environment is activated.

import os
import json

#Get databricks cluster associated with current virtual env
DATABRICKS_CLUSTER = os.getenv('DATABRICKS_CLUSTER')
HOME = os.getenv('HOME')

#Open the databricks-connect config file
with open(f'{HOME}/.databricks-connect', 'r') as j:
    config = json.loads(j.read())

#Update new cluster ID
config['cluster_id'] = DATABRICKS_CLUSTER

#Save the databricks connect config file
with open(f'{HOME}/.databricks-connect', 'w') as outfile:
    json.dump(config, outfile, indent=4)
别闹i 2025-01-23 19:13:14

它可能不会直接回答您的问题,但也可以使用 Visual Studio Databricks 插件,该插件将使用 databricks 连接,并且从那里可以很容易地切换到不同的环境。 https://marketplace.visualstudio.com/items?itemName=paiqo.databricks- vscode

        "databricks.connectionManager": "VSCode Settings",
        "databricks.connections": [
            {
                "apiRootUrl": "https://westeurope.azuredatabricks.net",
                "displayName": "My DEV workspace",
                "localSyncFolder": "c:\\Databricks\\dev",
                "personalAccessToken": "dapi219e30212312311c6721a66ce879e"
            },
            {
                "apiRootUrl": "https://westeurope.azuredatabricks.net",
                "displayName": "My TEST workspace",
                "localSyncFolder": "c:\\Databricks\\test",
                "personalAccessToken": "dapi219e30212312311c672aaaaaaaaaa"
            }
        ],
        ...

Probably it doesn't answer your question directly, but it's also possible to use Visual Studio Databricks plugin that will use databricks connect and from there is very easy to switch to different envs. https://marketplace.visualstudio.com/items?itemName=paiqo.databricks-vscode.

        "databricks.connectionManager": "VSCode Settings",
        "databricks.connections": [
            {
                "apiRootUrl": "https://westeurope.azuredatabricks.net",
                "displayName": "My DEV workspace",
                "localSyncFolder": "c:\\Databricks\\dev",
                "personalAccessToken": "dapi219e30212312311c6721a66ce879e"
            },
            {
                "apiRootUrl": "https://westeurope.azuredatabricks.net",
                "displayName": "My TEST workspace",
                "localSyncFolder": "c:\\Databricks\\test",
                "personalAccessToken": "dapi219e30212312311c672aaaaaaaaaa"
            }
        ],
        ...
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文