df.count() 对我不起作用,我该怎么办?
我使用pyspark进行情感分析项目,当我进行数据预处理等时,我使用textblob来查看Tweet的情感,我得到了结果,然后将其转换为DF,
# Convert RDD Back to DataFrame
File_new_df = sqlContext.createDataFrame(File_rdd_new)
File_new_df.show(5)
+--------------------+------------------+--------------------+---------+
| tweet_text| Subjectivity| Polarity|Sentiment|
+--------------------+------------------+--------------------+---------+
| tweettext| 0.0| 0.0| Neutral|
|woman faces lashe...| 0.625| 0.125| Positive|
| worldcup | 0.0| 0.0| Neutral|
|going expose leak...|0.6386363636363637|-0.03257575757575757| Negative|
|qatar whose autho...|0.8333333333333334| 0.5| Positive|
+--------------------+------------------+--------------------+---------+
only showing top 5 rows
现在我想将其转换为DF:现在我想去做:(情感是具有正面或负面或中性的列)
File_new_df.groupBy("Sentiment").count().show(3)
df.count()
doens doens不起作用,我该怎么办????
但是,当我在数据框上调用.count()方法时,它会抛出以下错误
Py4JJavaError Traceback (most recent call last)
<ipython-input-44-59bd5bd510b2> in <module>
----> 1 File_new_df.groupBy("Sentiment").count().show(3)
C:\spark\spark\python\pyspark\sql\dataframe.py in show(self, n, truncate, vertical)
482 """
483 if isinstance(truncate, bool) and truncate:
--> 484 print(self._jdf.showString(n, 20, vertical))
485 else:
486 print(self._jdf.showString(n, int(truncate), vertical))
C:\spark\spark\python\lib\py4j-0.10.9-src.zip\py4j\java_gateway.py in __call__(self, *args)
1302
1303 answer = self.gateway_client.send_command(command)
-> 1304 return_value = get_return_value(
1305 answer, self.gateway_client, self.target_id, self.name)
1306
C:\spark\spark\python\pyspark\sql\utils.py in deco(*a, **kw)
109 def deco(*a, **kw):
110 try:
--> 111 return f(*a, **kw)
112 except py4j.protocol.Py4JJavaError as e:
113 converted = convert_exception(e.java_exception)
C:\spark\spark\python\lib\py4j-0.10.9-src.zip\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name)
324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
325 if answer[1] == REFERENCE_TYPE:
--> 326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
328 format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling o705.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 81.0 failed 1 times, most recent failure: Lost task 0.0 in stage 81.0 (TID 504) (DESKTOP-95B8MQL executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "C:\spark\spark\python\lib\pyspark.zip\pyspark\worker.py", line 604, in main
File "C:\spark\spark\python\lib\pyspark.zip\pyspark\worker.py", line 596, in process
File "C:\spark\spark\python\lib\pyspark.zip\pyspark\serializers.py", line 259, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "C:\spark\spark\python\lib\pyspark.zip\pyspark\util.py", line 73, in wrapper
return f(*args, **kwargs)
File "<ipython-input-32-244ea6f47285>", line 21, in <lambda>
File "<ipython-input-32-244ea6f47285>", line 8, in rowwise_function
File "<ipython-input-31-d55b51a92547>", line 2, in getSubjectivity
File "C:\ProgramData\Anaconda3\lib\site-packages\textblob\blob.py", line 384, in __init__
raise TypeError('The `text` argument passed to `__init__(text)` '
TypeError: The `text` argument passed to `__init__(text)` must be a string, not <class 'NoneType'>
i work on sentiment analysis project using Pyspark, when i do the preprocessing of data and so on, i use TextBlob to see what is the sentiment of the tweet, i get the result and i convert it to df like this :
# Convert RDD Back to DataFrame
File_new_df = sqlContext.createDataFrame(File_rdd_new)
File_new_df.show(5)
+--------------------+------------------+--------------------+---------+
| tweet_text| Subjectivity| Polarity|Sentiment|
+--------------------+------------------+--------------------+---------+
| tweettext| 0.0| 0.0| Neutral|
|woman faces lashe...| 0.625| 0.125| Positive|
| worldcup | 0.0| 0.0| Neutral|
|going expose leak...|0.6386363636363637|-0.03257575757575757| Negative|
|qatar whose autho...|0.8333333333333334| 0.5| Positive|
+--------------------+------------------+--------------------+---------+
only showing top 5 rows
Now i want to do : (Sentiment is the column which has positive or negative or neutral)
File_new_df.groupBy("Sentiment").count().show(3)
df.count()
doens't work, how can i do????
However when I call the .count() method on the dataframe it throws the below error
Py4JJavaError Traceback (most recent call last)
<ipython-input-44-59bd5bd510b2> in <module>
----> 1 File_new_df.groupBy("Sentiment").count().show(3)
C:\spark\spark\python\pyspark\sql\dataframe.py in show(self, n, truncate, vertical)
482 """
483 if isinstance(truncate, bool) and truncate:
--> 484 print(self._jdf.showString(n, 20, vertical))
485 else:
486 print(self._jdf.showString(n, int(truncate), vertical))
C:\spark\spark\python\lib\py4j-0.10.9-src.zip\py4j\java_gateway.py in __call__(self, *args)
1302
1303 answer = self.gateway_client.send_command(command)
-> 1304 return_value = get_return_value(
1305 answer, self.gateway_client, self.target_id, self.name)
1306
C:\spark\spark\python\pyspark\sql\utils.py in deco(*a, **kw)
109 def deco(*a, **kw):
110 try:
--> 111 return f(*a, **kw)
112 except py4j.protocol.Py4JJavaError as e:
113 converted = convert_exception(e.java_exception)
C:\spark\spark\python\lib\py4j-0.10.9-src.zip\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name)
324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
325 if answer[1] == REFERENCE_TYPE:
--> 326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
328 format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling o705.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 81.0 failed 1 times, most recent failure: Lost task 0.0 in stage 81.0 (TID 504) (DESKTOP-95B8MQL executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "C:\spark\spark\python\lib\pyspark.zip\pyspark\worker.py", line 604, in main
File "C:\spark\spark\python\lib\pyspark.zip\pyspark\worker.py", line 596, in process
File "C:\spark\spark\python\lib\pyspark.zip\pyspark\serializers.py", line 259, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "C:\spark\spark\python\lib\pyspark.zip\pyspark\util.py", line 73, in wrapper
return f(*args, **kwargs)
File "<ipython-input-32-244ea6f47285>", line 21, in <lambda>
File "<ipython-input-32-244ea6f47285>", line 8, in rowwise_function
File "<ipython-input-31-d55b51a92547>", line 2, in getSubjectivity
File "C:\ProgramData\Anaconda3\lib\site-packages\textblob\blob.py", line 384, in __init__
raise TypeError('The `text` argument passed to `__init__(text)` '
TypeError: The `text` argument passed to `__init__(text)` must be a string, not <class 'NoneType'>
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论