df.count() 对我不起作用,我该怎么办?

发布于 2025-01-17 21:50:34 字数 3959 浏览 2 评论 0原文

我使用pyspark进行情感分析项目,当我进行数据预处理等时,我使用textblob来查看Tweet的情感,我得到了结果,然后将其转换为DF,

# Convert RDD Back to DataFrame
File_new_df = sqlContext.createDataFrame(File_rdd_new)
File_new_df.show(5)

+--------------------+------------------+--------------------+---------+
|          tweet_text|      Subjectivity|            Polarity|Sentiment|
+--------------------+------------------+--------------------+---------+
|           tweettext|               0.0|                 0.0|  Neutral|
|woman faces lashe...|             0.625|               0.125| Positive|
|           worldcup |               0.0|                 0.0|  Neutral|
|going expose leak...|0.6386363636363637|-0.03257575757575757| Negative|
|qatar whose autho...|0.8333333333333334|                 0.5| Positive|
+--------------------+------------------+--------------------+---------+
only showing top 5 rows

现在我想将其转换为DF:现在我想去做:(情感是具有正面或负面或中性的列)

File_new_df.groupBy("Sentiment").count().show(3) 

df.count() doens doens不起作用,我该怎么办????

但是,当我在数据框上调用.count()方法时,它会抛出以下错误

Py4JJavaError                             Traceback (most recent call last)
<ipython-input-44-59bd5bd510b2> in <module>
----> 1 File_new_df.groupBy("Sentiment").count().show(3)

C:\spark\spark\python\pyspark\sql\dataframe.py in show(self, n, truncate, vertical)
    482         """
    483         if isinstance(truncate, bool) and truncate:
--> 484             print(self._jdf.showString(n, 20, vertical))
    485         else:
    486             print(self._jdf.showString(n, int(truncate), vertical))

C:\spark\spark\python\lib\py4j-0.10.9-src.zip\py4j\java_gateway.py in __call__(self, *args)
   1302 
   1303         answer = self.gateway_client.send_command(command)
-> 1304         return_value = get_return_value(
   1305             answer, self.gateway_client, self.target_id, self.name)
   1306 

C:\spark\spark\python\pyspark\sql\utils.py in deco(*a, **kw)
    109     def deco(*a, **kw):
    110         try:
--> 111             return f(*a, **kw)
    112         except py4j.protocol.Py4JJavaError as e:
    113             converted = convert_exception(e.java_exception)

C:\spark\spark\python\lib\py4j-0.10.9-src.zip\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name)
    324             value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
    325             if answer[1] == REFERENCE_TYPE:
--> 326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
    328                     format(target_id, ".", name), value)

Py4JJavaError: An error occurred while calling o705.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 81.0 failed 1 times, most recent failure: Lost task 0.0 in stage 81.0 (TID 504) (DESKTOP-95B8MQL executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "C:\spark\spark\python\lib\pyspark.zip\pyspark\worker.py", line 604, in main
  File "C:\spark\spark\python\lib\pyspark.zip\pyspark\worker.py", line 596, in process
  File "C:\spark\spark\python\lib\pyspark.zip\pyspark\serializers.py", line 259, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "C:\spark\spark\python\lib\pyspark.zip\pyspark\util.py", line 73, in wrapper
    return f(*args, **kwargs)
  File "<ipython-input-32-244ea6f47285>", line 21, in <lambda>
  File "<ipython-input-32-244ea6f47285>", line 8, in rowwise_function
  File "<ipython-input-31-d55b51a92547>", line 2, in getSubjectivity
  File "C:\ProgramData\Anaconda3\lib\site-packages\textblob\blob.py", line 384, in __init__
    raise TypeError('The `text` argument passed to `__init__(text)` '
TypeError: The `text` argument passed to `__init__(text)` must be a string, not <class 'NoneType'>

i work on sentiment analysis project using Pyspark, when i do the preprocessing of data and so on, i use TextBlob to see what is the sentiment of the tweet, i get the result and i convert it to df like this :

# Convert RDD Back to DataFrame
File_new_df = sqlContext.createDataFrame(File_rdd_new)
File_new_df.show(5)

+--------------------+------------------+--------------------+---------+
|          tweet_text|      Subjectivity|            Polarity|Sentiment|
+--------------------+------------------+--------------------+---------+
|           tweettext|               0.0|                 0.0|  Neutral|
|woman faces lashe...|             0.625|               0.125| Positive|
|           worldcup |               0.0|                 0.0|  Neutral|
|going expose leak...|0.6386363636363637|-0.03257575757575757| Negative|
|qatar whose autho...|0.8333333333333334|                 0.5| Positive|
+--------------------+------------------+--------------------+---------+
only showing top 5 rows

Now i want to do : (Sentiment is the column which has positive or negative or neutral)

File_new_df.groupBy("Sentiment").count().show(3) 

df.count() doens't work, how can i do????

However when I call the .count() method on the dataframe it throws the below error

Py4JJavaError                             Traceback (most recent call last)
<ipython-input-44-59bd5bd510b2> in <module>
----> 1 File_new_df.groupBy("Sentiment").count().show(3)

C:\spark\spark\python\pyspark\sql\dataframe.py in show(self, n, truncate, vertical)
    482         """
    483         if isinstance(truncate, bool) and truncate:
--> 484             print(self._jdf.showString(n, 20, vertical))
    485         else:
    486             print(self._jdf.showString(n, int(truncate), vertical))

C:\spark\spark\python\lib\py4j-0.10.9-src.zip\py4j\java_gateway.py in __call__(self, *args)
   1302 
   1303         answer = self.gateway_client.send_command(command)
-> 1304         return_value = get_return_value(
   1305             answer, self.gateway_client, self.target_id, self.name)
   1306 

C:\spark\spark\python\pyspark\sql\utils.py in deco(*a, **kw)
    109     def deco(*a, **kw):
    110         try:
--> 111             return f(*a, **kw)
    112         except py4j.protocol.Py4JJavaError as e:
    113             converted = convert_exception(e.java_exception)

C:\spark\spark\python\lib\py4j-0.10.9-src.zip\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name)
    324             value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
    325             if answer[1] == REFERENCE_TYPE:
--> 326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
    328                     format(target_id, ".", name), value)

Py4JJavaError: An error occurred while calling o705.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 81.0 failed 1 times, most recent failure: Lost task 0.0 in stage 81.0 (TID 504) (DESKTOP-95B8MQL executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "C:\spark\spark\python\lib\pyspark.zip\pyspark\worker.py", line 604, in main
  File "C:\spark\spark\python\lib\pyspark.zip\pyspark\worker.py", line 596, in process
  File "C:\spark\spark\python\lib\pyspark.zip\pyspark\serializers.py", line 259, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "C:\spark\spark\python\lib\pyspark.zip\pyspark\util.py", line 73, in wrapper
    return f(*args, **kwargs)
  File "<ipython-input-32-244ea6f47285>", line 21, in <lambda>
  File "<ipython-input-32-244ea6f47285>", line 8, in rowwise_function
  File "<ipython-input-31-d55b51a92547>", line 2, in getSubjectivity
  File "C:\ProgramData\Anaconda3\lib\site-packages\textblob\blob.py", line 384, in __init__
    raise TypeError('The `text` argument passed to `__init__(text)` '
TypeError: The `text` argument passed to `__init__(text)` must be a string, not <class 'NoneType'>

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文