Google DataProc作业与本地Keytab / TicketCache文件一起提交

发布于 2025-02-11 09:12:21 字数 4726 浏览 1 评论 0原文

我正在尝试提交一个数据ploc作业,该作业将消费来自Kerberized Kafka群集的数据。 当前的工作解决方案是在计算机上具有JAAS配置文件和键盘,该文件正在制作DataProc作业提交命令:

gcloud dataproc jobs submit pyspark \
    --cluster MY-CLUSTER --region us-west1 --project MY_PROJECT \
    --files my_keytab_file.keytab,my_jaas_file.conf \
    --properties spark.driver.extraJavaOptions=-Djava.security.auth.login.config=my_jaas_file.conf,spark.executor.extraJavaOptions=-Djava.security.auth.login.config=my_jaas_file.conf \
    gs://CODE_BUCKET/path/to/python/main.py 

my_jaas_file.conf的内容:

KafkaClient {
  com.sun.security.auth.module.Krb5LoginModule required
  debug=true
  useKeyTab=true
  serviceName="kafka"
  keyTab="my_keytab_file.keytab"
  principal="[email protected]";
};

消费者代码:

spark = SparkSession \
    .builder \
    .appName("MY_APP") \
    .master("yarn") \
    .getOrCreate()

df = spark.read.format("kafka") \
    .option("kafka.bootstrap.servers", "BOOTSTRAP_SERVERS_LIST[broker:port,broker:port,broker:port]") \
    .option("kafka.sasl.mechanism", "GSSAPI") \
    .option("kafka.security.protocol", "SASL_SSL") \
    .option("kafka.group.id", "PREDEFINED_CG") \
    .option("subscribe", "MY_TOPIC") \
    .option("startingOffsets", "earliest") \
    .option("endingOffsets", "latest") \
    .load() 

df.show()

将文件复制到GCS,我认为它们是在Yarn中复制的工作区。 JVM能够捡起它们,并且身份验证成功。

但是,此设置是不可行的,因为我将无法访问KeyTab文件。 Keytab将成为部署过程的一部分,并将在磁盘上的位置下在主和工作节点上可用。服务将拾取keytab文件并了解一个缓存文件,该文件将成为Kerberized Kafka身份验证的来源。

我已经尝试在主和每个节点上制作JAAS配置文件:

nano /path/to/keytab/my_jaas_file.config
# variant 1
KafkaClient {
  com.sun.security.auth.module.Krb5LoginModule required
  debug=true
  useKeyTab=true
  serviceName="kafka"
  keyTab="/path/to/keytab/my_keytab_file.keytab"
  principal="[email protected]";
};
# variant 2
KafkaClient {
  com.sun.security.auth.module.Krb5LoginModule required debug=true
  useTicketCache=true
  ticketCache="/path/to/keytab/krb5_ccache"
  serviceName="kafka"
  principal="[email protected]";
};

并使用以下配置启动数据proc:

gcloud dataproc jobs submit pyspark \
    --cluster MY-CLUSTER --region us-west1 --project MY_PROJECT \
    --properties spark.driver.extraJavaOptions=-Djava.security.auth.login.config=file:///path/to/keytab/my_jaas_file.config,spark.executor.extraJavaOptions=-Djava.security.auth.login.config=file:///path/to/keytab/my_jaas_file.config \
    gs://CODE_BUCKET/path/to/python/main.py 

JAAS配置文件已正确拾取并通过spark Process从磁盘中读取并读取,因为我故意从一个节点中删除了它,并且它是从一个节点中删除的,并且失败了“未找到”错误。 keytab文件或ticketcache文件未被选中,并且正在生成以下错误:

org.apache.kafka.common.KafkaException: Failed to construct kafka consumer
org.apache.kafka.common.KafkaException: javax.security.auth.login.LoginException: Could not login: the client is being asked for a password, but the Kafka client code does not currently support obtaining a password from the user. not available to garner  authentication information from the user
javax.security.auth.login.LoginException: Could not login: the client is being asked for a password, but the Kafka client code does not currently support obtaining a password from the user. not available to garner  authentication information from the user

挖掘 krb5loginmodule 文档,似乎这是默认行为:

当提供了检索票或钥匙的多种机制时 偏好顺序为:

  1. 机票缓存
  2. keytab
  3. 共享状态
  4. 用户提示

1:

  1. 它能够从JAAS文件(file://参考工作)中选择设置,该设置存储在每个主 / Worker节点上的本地磁盘上
  2. / path上的keytab上的搜索/to/keytab/my_keytab_file.keytab
  3. 找不到它,搜索是否可用。
  4. KeyCache不可用,对于共享状态,
  5. 没有共享状态
  6. 要求使用用户名和密码的登录信息 - >在当前上下文(pyspark作业)下,这是不可能的
  7. 无法登录:要求客户端输入密码,但是KAFKA客户端代码当前不支持从用户获取密码。无法从用户获得身份验证信息< / code>

我尝试过多种方法来定义JAAS配置中的KeyTab / ccache文件:

keyTab="file:/path/to/keytab/my_keytab_file.keytab"
keyTab="file:///path/to/keytab/my_keytab_file.keytab"
keyTab="local:/path/to/keytab/my_keytab_file.keytab"

但是它们似乎都没有选择所需的KeyTab文件。

Spark和DataProc在幕后做很多事情。

I am trying to submit a dataproc job that will consume data from a Kerberized Kafka cluster.
Current working solution is to have the jaas config file and keytab on the machine which is making the dataproc jobs submit command:

gcloud dataproc jobs submit pyspark \
    --cluster MY-CLUSTER --region us-west1 --project MY_PROJECT \
    --files my_keytab_file.keytab,my_jaas_file.conf \
    --properties spark.driver.extraJavaOptions=-Djava.security.auth.login.config=my_jaas_file.conf,spark.executor.extraJavaOptions=-Djava.security.auth.login.config=my_jaas_file.conf \
    gs://CODE_BUCKET/path/to/python/main.py 

The contents of my_jaas_file.conf:

KafkaClient {
  com.sun.security.auth.module.Krb5LoginModule required
  debug=true
  useKeyTab=true
  serviceName="kafka"
  keyTab="my_keytab_file.keytab"
  principal="[email protected]";
};

Consumer code:

spark = SparkSession \
    .builder \
    .appName("MY_APP") \
    .master("yarn") \
    .getOrCreate()

df = spark.read.format("kafka") \
    .option("kafka.bootstrap.servers", "BOOTSTRAP_SERVERS_LIST[broker:port,broker:port,broker:port]") \
    .option("kafka.sasl.mechanism", "GSSAPI") \
    .option("kafka.security.protocol", "SASL_SSL") \
    .option("kafka.group.id", "PREDEFINED_CG") \
    .option("subscribe", "MY_TOPIC") \
    .option("startingOffsets", "earliest") \
    .option("endingOffsets", "latest") \
    .load() 

df.show()

The files get copied to GCS and I think from there they are copied in yarn workspace. The JVM is able to pick them up and the authentication is successful.

However, this setup is not feasible as I will not be able to have access to keytab file. The keytab will be part of a deployment process and will be available on master and worker nodes, under a location on disk. A service will pick up the keytab file and mentain a cache file, which will become the source for kerberized kafka authentification.

I have tried making a jaas config file on master and each node:

nano /path/to/keytab/my_jaas_file.config
# variant 1
KafkaClient {
  com.sun.security.auth.module.Krb5LoginModule required
  debug=true
  useKeyTab=true
  serviceName="kafka"
  keyTab="/path/to/keytab/my_keytab_file.keytab"
  principal="[email protected]";
};
# variant 2
KafkaClient {
  com.sun.security.auth.module.Krb5LoginModule required debug=true
  useTicketCache=true
  ticketCache="/path/to/keytab/krb5_ccache"
  serviceName="kafka"
  principal="[email protected]";
};

And start the dataproc with the following configuration:

gcloud dataproc jobs submit pyspark \
    --cluster MY-CLUSTER --region us-west1 --project MY_PROJECT \
    --properties spark.driver.extraJavaOptions=-Djava.security.auth.login.config=file:///path/to/keytab/my_jaas_file.config,spark.executor.extraJavaOptions=-Djava.security.auth.login.config=file:///path/to/keytab/my_jaas_file.config \
    gs://CODE_BUCKET/path/to/python/main.py 

The jaas configuration file is correctly picked up and read by spark process from disk, because I intentionally deleted it from one node, and it failed with "File not found" error.
The keytab file or ticketCache file is not being picked, and the following error is being generated:

org.apache.kafka.common.KafkaException: Failed to construct kafka consumer
org.apache.kafka.common.KafkaException: javax.security.auth.login.LoginException: Could not login: the client is being asked for a password, but the Kafka client code does not currently support obtaining a password from the user. not available to garner  authentication information from the user
javax.security.auth.login.LoginException: Could not login: the client is being asked for a password, but the Kafka client code does not currently support obtaining a password from the user. not available to garner  authentication information from the user

After digging through Krb5LoginModule documentation, it seems that this is the default behavior:

When multiple mechanisms to retrieve a ticket or key is provided, the
preference order is:

  1. ticket cache
  2. keytab
  3. shared state
  4. user prompt

For variant 1:

  1. It is able to pick settings from jaas file ( file:// reference works ) that is stored on local disk on each master / worker node
  2. Searches for keytab on /path/to/keytab/my_keytab_file.keytab
  3. Does not find it, searches if a keycache is available.
  4. Keycache is not available, goes for shared state
  5. No login information is defined in the shared state
  6. Asks for username and password -> which is not possible under current context ( pyspark job )
  7. Throws error: Could not login: the client is being asked for a password, but the Kafka client code does not currently support obtaining a password from the user. not available to garner authentication information from the user

I have tried multiple ways for defining keytab / ccache file in jaas config:

keyTab="file:/path/to/keytab/my_keytab_file.keytab"
keyTab="file:///path/to/keytab/my_keytab_file.keytab"
keyTab="local:/path/to/keytab/my_keytab_file.keytab"

But none of them seems to pick up the so needed keytab file.

There are a lot of things spark and dataproc do behind the scenes.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

北渚 2025-02-18 09:12:21

设法解决了!

似乎任何其他用户都无法访问CCACHE文件 /键盘文件。

sudo chmod 744 /path/to/keytab/my_jaas_file.config
sudo chmod 744 /path/to/keytab/krb5_ccache
sudo chmod 744 /path/to/keytab/my_keytab_file.keytab

该作业用root用户在驱动程序上运行,但在执行者上没有使用root运行。它可能使用纱线或Hadoop用户。

希望这有助于其他流浪的灵魂!

Managed to solve it!

It seems that the ccache file / keytab file were not accessible by any other users.

sudo chmod 744 /path/to/keytab/my_jaas_file.config
sudo chmod 744 /path/to/keytab/krb5_ccache
sudo chmod 744 /path/to/keytab/my_keytab_file.keytab

The job runs on driver with root user, but on executors is not ran with root. It is probably using yarn or hadoop user.

Hope this helps other wandering souls!

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文