pyspark,找到整个单词的子字符串

发布于 2025-02-05 12:13:03 字数 1140 浏览 2 评论 0原文

我想看看整个单词是否包含一个字符串列。如上所述此处或使用array_contains如上所述 s”>在这里

第一种方法在以下边缘情况下失败:

+---------+-----------------------+
|candidate| sentence              |
+---------+-----------------------+
|  su     |We saw the survivors.  |
+---------+-----------------------+

su应作为单独的单词找到,而不是句子列的纯substring。

当候选人是复合词时,第二种方法将失败。一个示例是:

+----------------+------------------------+
|candidate       | sentence               |
+----------------+------------------------+
|  Roman emperor | He was a Roman emperor.|
+----------------+------------------------+

第二种方法在这里失败,因为它将句子列转换为一个令牌:[HE,A,A,Roman,Roman,Emperor],它们都不等于Roman皇帝

有什么办法可以解决此问题?

I would like to see if a string column is contained in another column as a whole word. There are few approaches like using contains as described here or using array_contains as described here.

The first approach fails in the following edge case:

+---------+-----------------------+
|candidate| sentence              |
+---------+-----------------------+
|  su     |We saw the survivors.  |
+---------+-----------------------+

su should be found as a separate word and not as a pure substring of the sentence column.

The second approach fails when the candidate is a compound word. An example is:

+----------------+------------------------+
|candidate       | sentence               |
+----------------+------------------------+
|  Roman emperor | He was a Roman emperor.|
+----------------+------------------------+

The second approach fails here because it turns the sentence column to an array of tokens: [He, was, a, Roman, emperor] and none of them is equal to Roman emperor.

Is there any way to resolve this issue?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

┊风居住的梦幻卍 2025-02-12 12:13:03

这可能仍然有边缘案例,但我希望您能得到一些想法。
我将使用REGEX_EXTRACT与句子匹配候选人。

首先,我将候选者转换为正则(即,将空间转换为\ s),然后将regex_extract与Word Boundare(\ b)使用。

df = (df.withColumn('regex', F.regexp_replace(F.col('candidate'), ' ', '\\\s'))
      .withColumn('match', F.expr(r"regexp_extract(sentence, concat('\\b', regex, '\\b'), 0)")))

结果

+-------------+-----------------------+--------------+-------------+
|    candidate|               sentence|         regex|        match|
+-------------+-----------------------+--------------+-------------+
|           su|  We saw the survivors.|            su|             |
|Roman emperor|He was a Roman emperor.|Roman\semperor|Roman emperor|
+-------------+-----------------------+--------------+-------------+

This probably still has edge cases but I hope you get some ideas.
I would use regex_extract to match the candidate against the sentence.

First, I convert the candidate to regex (ie, convert space to \s), then use regex_extract with word boundary (\b).

df = (df.withColumn('regex', F.regexp_replace(F.col('candidate'), ' ', '\\\s'))
      .withColumn('match', F.expr(r"regexp_extract(sentence, concat('\\b', regex, '\\b'), 0)")))

Result

+-------------+-----------------------+--------------+-------------+
|    candidate|               sentence|         regex|        match|
+-------------+-----------------------+--------------+-------------+
|           su|  We saw the survivors.|            su|             |
|Roman emperor|He was a Roman emperor.|Roman\semperor|Roman emperor|
+-------------+-----------------------+--------------+-------------+
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文