df.count() 对我不起作用，我该怎么办？

发布于 2025-01-17 21:50:34 字数 3959 浏览 2 评论 0原文

我使用pyspark进行情感分析项目，当我进行数据预处理等时，我使用textblob来查看Tweet的情感，我得到了结果，然后将其转换为DF，

# Convert RDD Back to DataFrame
File_new_df = sqlContext.createDataFrame(File_rdd_new)
File_new_df.show(5)

+--------------------+------------------+--------------------+---------+
|          tweet_text|      Subjectivity|            Polarity|Sentiment|
+--------------------+------------------+--------------------+---------+
|           tweettext|               0.0|                 0.0|  Neutral|
|woman faces lashe...|             0.625|               0.125| Positive|
|           worldcup |               0.0|                 0.0|  Neutral|
|going expose leak...|0.6386363636363637|-0.03257575757575757| Negative|
|qatar whose autho...|0.8333333333333334|                 0.5| Positive|
+--------------------+------------------+--------------------+---------+
only showing top 5 rows

现在我想将其转换为DF：现在我想去做：（情感是具有正面或负面或中性的列）

File_new_df.groupBy("Sentiment").count().show(3)

df.count（） doens doens不起作用，我该怎么办？？？？

但是，当我在数据框上调用.count（）方法时，它会抛出以下错误

Py4JJavaError                             Traceback (most recent call last)
<ipython-input-44-59bd5bd510b2> in <module>
----> 1 File_new_df.groupBy("Sentiment").count().show(3)

C:\spark\spark\python\pyspark\sql\dataframe.py in show(self, n, truncate, vertical)
    482         """
    483         if isinstance(truncate, bool) and truncate:
--> 484             print(self._jdf.showString(n, 20, vertical))
    485         else:
    486             print(self._jdf.showString(n, int(truncate), vertical))

C:\spark\spark\python\lib\py4j-0.10.9-src.zip\py4j\java_gateway.py in __call__(self, *args)
   1302 
   1303         answer = self.gateway_client.send_command(command)
-> 1304         return_value = get_return_value(
   1305             answer, self.gateway_client, self.target_id, self.name)
   1306 

C:\spark\spark\python\pyspark\sql\utils.py in deco(*a, **kw)
    109     def deco(*a, **kw):
    110         try:
--> 111             return f(*a, **kw)
    112         except py4j.protocol.Py4JJavaError as e:
    113             converted = convert_exception(e.java_exception)

C:\spark\spark\python\lib\py4j-0.10.9-src.zip\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name)
    324             value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
    325             if answer[1] == REFERENCE_TYPE:
--> 326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
    328                     format(target_id, ".", name), value)

Py4JJavaError: An error occurred while calling o705.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 81.0 failed 1 times, most recent failure: Lost task 0.0 in stage 81.0 (TID 504) (DESKTOP-95B8MQL executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "C:\spark\spark\python\lib\pyspark.zip\pyspark\worker.py", line 604, in main
  File "C:\spark\spark\python\lib\pyspark.zip\pyspark\worker.py", line 596, in process
  File "C:\spark\spark\python\lib\pyspark.zip\pyspark\serializers.py", line 259, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "C:\spark\spark\python\lib\pyspark.zip\pyspark\util.py", line 73, in wrapper
    return f(*args, **kwargs)
  File "<ipython-input-32-244ea6f47285>", line 21, in <lambda>
  File "<ipython-input-32-244ea6f47285>", line 8, in rowwise_function
  File "<ipython-input-31-d55b51a92547>", line 2, in getSubjectivity
  File "C:\ProgramData\Anaconda3\lib\site-packages\textblob\blob.py", line 384, in __init__
    raise TypeError('The `text` argument passed to `__init__(text)` '
TypeError: The `text` argument passed to `__init__(text)` must be a string, not <class 'NoneType'>

原文

i work on sentiment analysis project using Pyspark, when i do the preprocessing of data and so on, i use TextBlob to see what is the sentiment of the tweet, i get the result and i convert it to df like this :

# Convert RDD Back to DataFrame
File_new_df = sqlContext.createDataFrame(File_rdd_new)
File_new_df.show(5)

+--------------------+------------------+--------------------+---------+
|          tweet_text|      Subjectivity|            Polarity|Sentiment|
+--------------------+------------------+--------------------+---------+
|           tweettext|               0.0|                 0.0|  Neutral|
|woman faces lashe...|             0.625|               0.125| Positive|
|           worldcup |               0.0|                 0.0|  Neutral|
|going expose leak...|0.6386363636363637|-0.03257575757575757| Negative|
|qatar whose autho...|0.8333333333333334|                 0.5| Positive|
+--------------------+------------------+--------------------+---------+
only showing top 5 rows

Now i want to do : (Sentiment is the column which has positive or negative or neutral)

File_new_df.groupBy("Sentiment").count().show(3)

df.count() doens't work, how can i do????

However when I call the .count() method on the dataframe it throws the below error

Py4JJavaError                             Traceback (most recent call last)
<ipython-input-44-59bd5bd510b2> in <module>
----> 1 File_new_df.groupBy("Sentiment").count().show(3)

C:\spark\spark\python\pyspark\sql\dataframe.py in show(self, n, truncate, vertical)
    482         """
    483         if isinstance(truncate, bool) and truncate:
--> 484             print(self._jdf.showString(n, 20, vertical))
    485         else:
    486             print(self._jdf.showString(n, int(truncate), vertical))

C:\spark\spark\python\lib\py4j-0.10.9-src.zip\py4j\java_gateway.py in __call__(self, *args)
   1302 
   1303         answer = self.gateway_client.send_command(command)
-> 1304         return_value = get_return_value(
   1305             answer, self.gateway_client, self.target_id, self.name)
   1306 

C:\spark\spark\python\pyspark\sql\utils.py in deco(*a, **kw)
    109     def deco(*a, **kw):
    110         try:
--> 111             return f(*a, **kw)
    112         except py4j.protocol.Py4JJavaError as e:
    113             converted = convert_exception(e.java_exception)

C:\spark\spark\python\lib\py4j-0.10.9-src.zip\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name)
    324             value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
    325             if answer[1] == REFERENCE_TYPE:
--> 326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
    328                     format(target_id, ".", name), value)

Py4JJavaError: An error occurred while calling o705.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 81.0 failed 1 times, most recent failure: Lost task 0.0 in stage 81.0 (TID 504) (DESKTOP-95B8MQL executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "C:\spark\spark\python\lib\pyspark.zip\pyspark\worker.py", line 604, in main
  File "C:\spark\spark\python\lib\pyspark.zip\pyspark\worker.py", line 596, in process
  File "C:\spark\spark\python\lib\pyspark.zip\pyspark\serializers.py", line 259, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "C:\spark\spark\python\lib\pyspark.zip\pyspark\util.py", line 73, in wrapper
    return f(*args, **kwargs)
  File "<ipython-input-32-244ea6f47285>", line 21, in <lambda>
  File "<ipython-input-32-244ea6f47285>", line 8, in rowwise_function
  File "<ipython-input-31-d55b51a92547>", line 2, in getSubjectivity
  File "C:\ProgramData\Anaconda3\lib\site-packages\textblob\blob.py", line 384, in __init__
    raise TypeError('The `text` argument passed to `__init__(text)` '
TypeError: The `text` argument passed to `__init__(text)` must be a string, not <class 'NoneType'>

分享到QQ

分享到微博