使用 java 的 BiGrams Spark

发布于 2025-01-09 04:58:33 字数 172 浏览 4 评论 0原文

我已经将句子放入 RDD 中,输出如下所示:

RT @DougJ7777:如果英国赢得#Eurovision,那么我们必须重新加入 欧盟。这是规则里的。 #Eurovision2018 RT @Mystificus:当然我会 今晚观看#eurovision。毕竟两亿人不可能 错了,可以吗?呃...

I already have the sentences in a RDD and the output looks like:

RT @DougJ7777: If Britain wins #Eurovision then we have to rejoin the
EU. It's in the rules. #Eurovision2018 RT @Mystificus: Of course I'll
watch #eurovision tonight. After all, 200 million people can't be
wrong, can they? Er...????????... RT @KlNGNEUER: Me when Europeans make
fun of Eurovision VS Me when Americans make fun of Eurovision

#Eurovision #EuroSemi2 Eurovision song contest 2018 tonight!!!!!! Saturday chills with bae, hands up who’s not watching
Eurovision… @AndrewDawes71 @SuzanneEvans1
@ConstantinStHe1 The tweet was directed at citizens of other countries
partaking in t… Looking forward to @Eurovision
@bbceurovision tonight and rooting for @surieofficial who has strong
competition. Sh… RT @Jem_Collins: Media and
journalism friends, I need you to do something during #Eurovision this
evening. And that something is to drink a… Getting ready for anime AND
Eurovision with friends tonight! ????

But when I try to split it by "." and "," I only get a empty txt using this code:

JavaRDD<String> sentences= lines.flatMap( line -> Arrays.asList(line.split(".")).iterator());
JavaRDD<String> words = sentences.flatMap( line -> Arrays.asList(line.split(" ")).iterator());

Where lines is an RDD with the content of the screenshot.

After that, how can I construct the bigrams?

REPRODUCE EXAMPLE:

SparkConf conf = new SparkConf().setAppName("BiGramsApp");
JavaSparkContext sparkContext = new JavaSparkContext(conf);
JavaRDD<String> inputFile = sparkContext.textFile(input);
JavaRDD<String> sentences = inputFile.flatMap(  line -> Arrays.asList(line.split(".")).iterator());
JavaRDD<String> words = sentences.flatMap( line -> Arrays.asList(line.split(" ")).iterator());
    
words.saveAsTextFile(outputDir);

The input file will be a .txt with any sentence, but you can try with the strings that are write at the beginning

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

暮年慕年 2025-01-16 04:58:33

拆分的解决方案是在 "[.]""[ ]" 之间添加模式

The solution to split is add the pattern between "[.]" or "[ ]"

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文