无法使用Pyspark从Hive查询表,错误显示我正在从错误的IP调用
sol.spark.sql("select * from type_match")
2022-04-19 10:31:33 WARN FileStreamSink:66 - Error while looking for metadata directory.
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "C:\Users\z00635559\PycharmProjects\pythonProject2\venv\lib\site-packages\pyspark\sql\session.py", line 710, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
File "C:\Users\z00635559\PycharmProjects\pythonProject2\venv\lib\site-packages\py4j\java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "C:\Users\z00635559\PycharmProjects\pythonProject2\venv\lib\site-packages\pyspark\sql\utils.py", line 63, in deco
return f(*a, **kw)
File "C:\Users\z00635559\PycharmProjects\pythonProject2\venv\lib\site-packages\py4j\protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o22.sql.
: java.util.concurrent.ExecutionException: org.apache.hadoop.net.ConnectTimeoutException: Call From A191136324/10.58.0.0 to 10.58.0.1:9000 failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=10.58.245.43/10.58.245.43:9000]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout
at org.spark_project.guava.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306)
at org.spark_project.guava.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293)
at org.spark_project.guava.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
at org.spark_project.guava.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135)
at org.spark_project.guava.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2410)
at org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2380)
at org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
at org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
at org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000)
at org.spark_project.guava.cache.LocalCache$LocalManualCache.get(LocalCache.java:4789)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.getCachedPlan(SessionCatalog.scala:137)
at org.apache.spark.sql.execution.datasources.FindDataSourceTable.org$apache$spark$sql$execution$datasources$FindDataSourceTable$$readDataSourceTable(DataSourceStrategy.scala:227)
at org.apache.spark.sql.execution.datasources.FindDataSourceTable$$anonfun$apply$2.applyOrElse(DataSourceStrategy.scala:264)
at org.apache.spark.sql.execution.datasources.FindDataSourceTable$$anonfun$apply$2.applyOrElse(DataSourceStrategy.scala:255)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
自从我将Worplace移到另一个工作场所以来,我的IP地址似乎已经改变了,但是在我的Hive上它仍然没有改变,这表明我仍在调用从最新的IP地址到旧IP,我的Hive运行正常,我可以蜂巢中的查询表,但是当我从pyspark查询表时,它似乎首先被卡住了一段时间,然后告诉我打电话给错误的IP,我应该修改任何设置吗?
PS:更改MySQL引擎中的DB和
SD
sol.spark.sql("select * from type_match")
2022-04-19 10:31:33 WARN FileStreamSink:66 - Error while looking for metadata directory.
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "C:\Users\z00635559\PycharmProjects\pythonProject2\venv\lib\site-packages\pyspark\sql\session.py", line 710, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
File "C:\Users\z00635559\PycharmProjects\pythonProject2\venv\lib\site-packages\py4j\java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "C:\Users\z00635559\PycharmProjects\pythonProject2\venv\lib\site-packages\pyspark\sql\utils.py", line 63, in deco
return f(*a, **kw)
File "C:\Users\z00635559\PycharmProjects\pythonProject2\venv\lib\site-packages\py4j\protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o22.sql.
: java.util.concurrent.ExecutionException: org.apache.hadoop.net.ConnectTimeoutException: Call From A191136324/10.58.0.0 to 10.58.0.1:9000 failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=10.58.245.43/10.58.245.43:9000]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout
at org.spark_project.guava.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306)
at org.spark_project.guava.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293)
at org.spark_project.guava.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
at org.spark_project.guava.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135)
at org.spark_project.guava.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2410)
at org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2380)
at org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
at org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
at org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000)
at org.spark_project.guava.cache.LocalCache$LocalManualCache.get(LocalCache.java:4789)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.getCachedPlan(SessionCatalog.scala:137)
at org.apache.spark.sql.execution.datasources.FindDataSourceTable.org$apache$spark$sql$execution$datasources$FindDataSourceTable$readDataSourceTable(DataSourceStrategy.scala:227)
at org.apache.spark.sql.execution.datasources.FindDataSourceTable$anonfun$apply$2.applyOrElse(DataSourceStrategy.scala:264)
at org.apache.spark.sql.execution.datasources.FindDataSourceTable$anonfun$apply$2.applyOrElse(DataSourceStrategy.scala:255)
at org.apache.spark.sql.catalyst.trees.TreeNode$anonfun$2.apply(TreeNode.scala:267)
Since I moved my worplace to another workplace, it seems like my IP address has changed, but on my hive it still not changed, it shows i am still call from the newest IP address to old IP, my hive is running okay, i can query table in my hive, but when i query table from pyspark, it seems get stuck first for a while, then tell me that calling to the wrong IP, is there any settings i should modify??
PS: changed dbs and sds in mysql engine, and i could access data from hive, but cannot query data from spark
thanks,
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
希望您在Spark Conf文件夹中复制了Hive Site.xml?
为什么需要它,因为-SparkSQL使用其默认Metastore(Derby),因此它没有有关Hive Metastore的信息。
因此,您必须在Spark Conf文件夹中复制Hive-site.xml($ Spark_home/Conf),
您还需要在执行SQL之前注册表。下面的代码对我有用,同时使用Pyspark从Hive访问表。
I hope you have copied hive-site.xml in the spark conf folder?
Why it's needed because - SparkSQL uses its default metastore (derby), so it doesn't have info about hive metastore.
So you have to copy hive-site.xml in the spark conf folder($SPARK_HOME/conf)
you also need to register the table before executing the SQL. Below code worked for me, while accessing table from hive using Pyspark.