使用 java 的 BiGrams Spark
我已经将句子放入 RDD 中,输出如下所示:
RT @DougJ7777:如果英国赢得#Eurovision,那么我们必须重新加入 欧盟。这是规则里的。 #Eurovision2018 RT @Mystificus:当然我会 今晚观看#eurovision。毕竟两亿人不可能 错了,可以吗?呃...
I already have the sentences in a RDD and the output looks like:
RT @DougJ7777: If Britain wins #Eurovision then we have to rejoin the
EU. It's in the rules. #Eurovision2018 RT @Mystificus: Of course I'll
watch #eurovision tonight. After all, 200 million people can't be
wrong, can they? Er...????????... RT @KlNGNEUER: Me when Europeans make
fun of Eurovision VS Me when Americans make fun of Eurovision#Eurovision #EuroSemi2 Eurovision song contest 2018 tonight!!!!!! Saturday chills with bae, hands up who’s not watching
Eurovision… @AndrewDawes71 @SuzanneEvans1
@ConstantinStHe1 The tweet was directed at citizens of other countries
partaking in t… Looking forward to @Eurovision
@bbceurovision tonight and rooting for @surieofficial who has strong
competition. Sh… RT @Jem_Collins: Media and
journalism friends, I need you to do something during #Eurovision this
evening. And that something is to drink a… Getting ready for anime AND
Eurovision with friends tonight! ????
But when I try to split it by "." and "," I only get a empty txt using this code:
JavaRDD<String> sentences= lines.flatMap( line -> Arrays.asList(line.split(".")).iterator());
JavaRDD<String> words = sentences.flatMap( line -> Arrays.asList(line.split(" ")).iterator());
Where lines is an RDD with the content of the screenshot.
After that, how can I construct the bigrams?
REPRODUCE EXAMPLE:
SparkConf conf = new SparkConf().setAppName("BiGramsApp");
JavaSparkContext sparkContext = new JavaSparkContext(conf);
JavaRDD<String> inputFile = sparkContext.textFile(input);
JavaRDD<String> sentences = inputFile.flatMap( line -> Arrays.asList(line.split(".")).iterator());
JavaRDD<String> words = sentences.flatMap( line -> Arrays.asList(line.split(" ")).iterator());
words.saveAsTextFile(outputDir);
The input file will be a .txt with any sentence, but you can try with the strings that are write at the beginning
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
拆分的解决方案是在
"[.]"
或"[ ]"
之间添加模式The solution to split is add the pattern between
"[.]"
or"[ ]"