pyspark,找到整个单词的子字符串
我想看看整个单词是否包含一个字符串列。如上所述此处或使用array_contains
如上所述 s”>在这里。
第一种方法在以下边缘情况下失败:
+---------+-----------------------+
|candidate| sentence |
+---------+-----------------------+
| su |We saw the survivors. |
+---------+-----------------------+
su
应作为单独的单词找到,而不是句子
列的纯substring。
当候选人是复合词时,第二种方法将失败。一个示例是:
+----------------+------------------------+
|candidate | sentence |
+----------------+------------------------+
| Roman emperor | He was a Roman emperor.|
+----------------+------------------------+
第二种方法在这里失败,因为它将句子列转换为一个令牌:[HE,A,A,Roman,Roman,Emperor]
,它们都不等于Roman皇帝
。
有什么办法可以解决此问题?
I would like to see if a string column is contained in another column as a whole word. There are few approaches like using contains
as described here or using array_contains
as described here.
The first approach fails in the following edge case:
+---------+-----------------------+
|candidate| sentence |
+---------+-----------------------+
| su |We saw the survivors. |
+---------+-----------------------+
su
should be found as a separate word and not as a pure substring of the sentence
column.
The second approach fails when the candidate is a compound word. An example is:
+----------------+------------------------+
|candidate | sentence |
+----------------+------------------------+
| Roman emperor | He was a Roman emperor.|
+----------------+------------------------+
The second approach fails here because it turns the sentence column to an array of tokens: [He, was, a, Roman, emperor]
and none of them is equal to Roman emperor
.
Is there any way to resolve this issue?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这可能仍然有边缘案例,但我希望您能得到一些想法。
我将使用
REGEX_EXTRACT
与句子匹配候选人。首先,我将候选者转换为正则(即,将空间转换为\ s),然后将
regex_extract
与Word Boundare(\ b)使用。结果
This probably still has edge cases but I hope you get some ideas.
I would use
regex_extract
to match the candidate against the sentence.First, I convert the candidate to regex (ie, convert space to \s), then use
regex_extract
with word boundary (\b).Result