无法使用Pyspark从Hive查询表，错误显示我正在从错误的IP调用

发布于 2025-01-22 07:34:13 字数 3217 浏览 0 评论 0原文

sol.spark.sql("select * from type_match")
2022-04-19 10:31:33 WARN  FileStreamSink:66 - Error while looking for metadata directory.
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "C:\Users\z00635559\PycharmProjects\pythonProject2\venv\lib\site-packages\pyspark\sql\session.py", line 710, in sql
    return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
  File "C:\Users\z00635559\PycharmProjects\pythonProject2\venv\lib\site-packages\py4j\java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "C:\Users\z00635559\PycharmProjects\pythonProject2\venv\lib\site-packages\pyspark\sql\utils.py", line 63, in deco
    return f(*a, **kw)
  File "C:\Users\z00635559\PycharmProjects\pythonProject2\venv\lib\site-packages\py4j\protocol.py", line 328, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o22.sql.
: java.util.concurrent.ExecutionException: org.apache.hadoop.net.ConnectTimeoutException: Call From A191136324/10.58.0.0 to 10.58.0.1:9000 failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=10.58.245.43/10.58.245.43:9000]; For more details see:  http://wiki.apache.org/hadoop/SocketTimeout
    at org.spark_project.guava.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306)
    at org.spark_project.guava.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293)
    at org.spark_project.guava.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
    at org.spark_project.guava.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135)
    at org.spark_project.guava.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2410)
    at org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2380)
    at org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
    at org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
    at org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000)
    at org.spark_project.guava.cache.LocalCache$LocalManualCache.get(LocalCache.java:4789)
    at org.apache.spark.sql.catalyst.catalog.SessionCatalog.getCachedPlan(SessionCatalog.scala:137)
    at org.apache.spark.sql.execution.datasources.FindDataSourceTable.org$apache$spark$sql$execution$datasources$FindDataSourceTable$$readDataSourceTable(DataSourceStrategy.scala:227)
    at org.apache.spark.sql.execution.datasources.FindDataSourceTable$$anonfun$apply$2.applyOrElse(DataSourceStrategy.scala:264)
    at org.apache.spark.sql.execution.datasources.FindDataSourceTable$$anonfun$apply$2.applyOrElse(DataSourceStrategy.scala:255)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)

自从我将Worplace移到另一个工作场所以来，我的IP地址似乎已经改变了，但是在我的Hive上它仍然没有改变，这表明我仍在调用从最新的IP地址到旧IP，我的Hive运行正常，我可以蜂巢中的查询表，但是当我从pyspark查询表时，它似乎首先被卡住了一段时间，然后告诉我打电话给错误的IP，我应该修改任何设置吗？

PS：更改MySQL引擎中的DB和

原文

sol.spark.sql("select * from type_match")
2022-04-19 10:31:33 WARN  FileStreamSink:66 - Error while looking for metadata directory.
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "C:\Users\z00635559\PycharmProjects\pythonProject2\venv\lib\site-packages\pyspark\sql\session.py", line 710, in sql
    return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
  File "C:\Users\z00635559\PycharmProjects\pythonProject2\venv\lib\site-packages\py4j\java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "C:\Users\z00635559\PycharmProjects\pythonProject2\venv\lib\site-packages\pyspark\sql\utils.py", line 63, in deco
    return f(*a, **kw)
  File "C:\Users\z00635559\PycharmProjects\pythonProject2\venv\lib\site-packages\py4j\protocol.py", line 328, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o22.sql.
: java.util.concurrent.ExecutionException: org.apache.hadoop.net.ConnectTimeoutException: Call From A191136324/10.58.0.0 to 10.58.0.1:9000 failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=10.58.245.43/10.58.245.43:9000]; For more details see:  http://wiki.apache.org/hadoop/SocketTimeout
    at org.spark_project.guava.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306)
    at org.spark_project.guava.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293)
    at org.spark_project.guava.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
    at org.spark_project.guava.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135)
    at org.spark_project.guava.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2410)
    at org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2380)
    at org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
    at org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
    at org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000)
    at org.spark_project.guava.cache.LocalCache$LocalManualCache.get(LocalCache.java:4789)
    at org.apache.spark.sql.catalyst.catalog.SessionCatalog.getCachedPlan(SessionCatalog.scala:137)
    at org.apache.spark.sql.execution.datasources.FindDataSourceTable.org$apache$spark$sql$execution$datasources$FindDataSourceTable$readDataSourceTable(DataSourceStrategy.scala:227)
    at org.apache.spark.sql.execution.datasources.FindDataSourceTable$anonfun$apply$2.applyOrElse(DataSourceStrategy.scala:264)
    at org.apache.spark.sql.execution.datasources.FindDataSourceTable$anonfun$apply$2.applyOrElse(DataSourceStrategy.scala:255)
    at org.apache.spark.sql.catalyst.trees.TreeNode$anonfun$2.apply(TreeNode.scala:267)

Since I moved my worplace to another workplace, it seems like my IP address has changed, but on my hive it still not changed, it shows i am still call from the newest IP address to old IP, my hive is running okay, i can query table in my hive, but when i query table from pyspark, it seems get stuck first for a while, then tell me that calling to the wrong IP, is there any settings i should modify??

PS: changed dbs and sds in mysql engine, and i could access data from hive, but cannot query data from spark

thanks,

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

筱果果 2025-01-29 07:34:13

希望您在Spark Conf文件夹中复制了Hive Site.xml？
为什么需要它，因为-SparkSQL使用其默认Metastore（Derby），因此它没有有关Hive Metastore的信息。

因此，您必须在Spark Conf文件夹中复制Hive-site.xml（$ Spark_home/Conf），

您还需要在执行SQL之前注册表。下面的代码对我有用，同时使用Pyspark从Hive访问表。

from pyspark.sql import HiveContext
hiveContext = HiveContext(sc)
table = hiveContext.table("schema.table")

table.registerTempTable("table_name")
hiveContext.sql("select * from table").show()

I hope you have copied hive-site.xml in the spark conf folder?
Why it's needed because - SparkSQL uses its default metastore (derby), so it doesn't have info about hive metastore.

So you have to copy hive-site.xml in the spark conf folder($SPARK_HOME/conf)

you also need to register the table before executing the SQL. Below code worked for me, while accessing table from hive using Pyspark.

from pyspark.sql import HiveContext
hiveContext = HiveContext(sc)
table = hiveContext.table("schema.table")

table.registerTempTable("table_name")
hiveContext.sql("select * from table").show()

回复收藏 0 原文

~没有更多了~