如何计算两个句子之间的相似度（句法和语义）

发布于 2024-09-17 22:42:51 字数 2696 浏览 6 评论 0原文

我应该每次取两个句子并计算它们是否相似。我所说的相似是指语法上和语义上的相似。

INPUT1：奥巴马签署了该法律。奥巴马签署了一项新法律。
输入2：一辆公共汽车停在这里。一辆车停在这里。
INPUT3：纽约发生火灾。纽约被烧毁。
INPUT4：纽约发生火灾。 50 人在纽约火灾中丧生。

我不想用本体树作为灵魂。我编写了一段代码来计算句子之间的 Levenshtein 距离 (LD)，然后决定第二句话是否：

可以忽略（INPUT1和2），
应替换第一句（INPUT 3），或
与第一句（INPUT4）一起存储。

我对代码不满意，因为 LD 只计算语法级别（还有什么其他方法？）。如何整合语义（就像公共汽车是一种车辆？）。

代码如下：

%# As the difference is computed, a decision is made on the new event
%# (string 2) to be ignored, to replace existing event (string 1) or to be
%# stored separately. The higher the LD metric, the higher the difference
%# between two strings. Of course, lower difference indices either identical
%# or similar events. However, the higher difference indicates the new event
%# as a fresh event.

%#.........................................................................
%# Calculating the LD between two strings of events.
%#.........................................................................
L1=length(str1)+1;
L2=length(str2)+1;
L=zeros(L1,L2);   %# Initializing the new length.

g=+1;             %# just constant
m=+0;             %# match is cheaper, we seek to minimize
d=+1;             %# not-a-match is more costly.

% do BC's
L(:,1)=([0:L1-1]*g)';
L(1,:)=[0:L2-1]*g;

m4=0;             %# loop invariant
%# Calculating required edits.
for idx=2:L1;
    for idy=2:L2
        if(str1(idx-1)==str2(idy-1))
            score=m;
        else
            score=d;
        end
        m1=L(idx-1,idy-1) + score;
        m2=L(idx-1,idy) + g;
        m3=L(idx,idy-1) + g;
        L(idx,idy)=min(m1,min(m2,m3)); % only minimum edits allowed.
    end
end
%# The LD between two strings.
D=L(L1,L2);

%#....................................................................
%# Making decision on what to do with the new event (string 2).
%#...................................................................
if (D<=4)     %# Distance is so less that string 2 seems identical to string 1.
    store=str1;        %# Hence string 2 is ignored. String 1 remains stored.
elseif (D>=5 && D<=15) %# Distance is larger to be identical but not enough to
    %# make string 2 an individual event.
    store= str2;       %# String 2 is somewhat similar to string 1.
                       %# So, string 1 is replaced with string 2 and stored.
else
    %# For all other distances, string 2 is stored along with string 1.
    store={str1; str2};
end

感谢任何帮助。

原文

I'm supposed to take two sentences each time and compute if they are similar. By similar I mean, both syntactically and semantically.

INPUT1: Obama signs the law.
A new law is signed by Obama.
INPUT2:
A Bus is stopped here.
A vehicle stops here.
INPUT3: Fire in NY.
NY is burnt down.
INPUT4: Fire in NY.
50 died in NY fire.

I don't want to use ontology tree as a soul. I wrote a code to compute Levenshtein distance (LD) between sentences and then decide if the 2nd sentence:

can be ignored (INPUT1 and 2),
should replace the first sentence (INPUT 3), or
store along with the first sentence (INPUT4).

I'm not happy with the code as LD only computes syntactical level (what other methods ?). How can semantic be incorporated (like bus is sort of a vehicle?) .

The code goes here:

%# As the difference is computed, a decision is made on the new event
%# (string 2) to be ignored, to replace existing event (string 1) or to be
%# stored separately. The higher the LD metric, the higher the difference
%# between two strings. Of course, lower difference indices either identical
%# or similar events. However, the higher difference indicates the new event
%# as a fresh event.

%#.........................................................................
%# Calculating the LD between two strings of events.
%#.........................................................................
L1=length(str1)+1;
L2=length(str2)+1;
L=zeros(L1,L2);   %# Initializing the new length.

g=+1;             %# just constant
m=+0;             %# match is cheaper, we seek to minimize
d=+1;             %# not-a-match is more costly.

% do BC's
L(:,1)=([0:L1-1]*g)';
L(1,:)=[0:L2-1]*g;

m4=0;             %# loop invariant
%# Calculating required edits.
for idx=2:L1;
    for idy=2:L2
        if(str1(idx-1)==str2(idy-1))
            score=m;
        else
            score=d;
        end
        m1=L(idx-1,idy-1) + score;
        m2=L(idx-1,idy) + g;
        m3=L(idx,idy-1) + g;
        L(idx,idy)=min(m1,min(m2,m3)); % only minimum edits allowed.
    end
end
%# The LD between two strings.
D=L(L1,L2);

%#....................................................................
%# Making decision on what to do with the new event (string 2).
%#...................................................................
if (D<=4)     %# Distance is so less that string 2 seems identical to string 1.
    store=str1;        %# Hence string 2 is ignored. String 1 remains stored.
elseif (D>=5 && D<=15) %# Distance is larger to be identical but not enough to
    %# make string 2 an individual event.
    store= str2;       %# String 2 is somewhat similar to string 1.
                       %# So, string 1 is replaced with string 2 and stored.
else
    %# For all other distances, string 2 is stored along with string 1.
    store={str1; str2};
end

Any help is appreciated.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

栀梦 2024-09-24 22:42:52

“语义上”。 没有简单的教科书算法可以做到这一点。自然语言（尤其是英语）是一种非常复杂且变化无常的野兽。让我们看一下所提供的案例（仅一小部分）：

INPUT1: Obama signs the law. A new law is signed by Obama.

签署一项法律使其成为一项“新”法律。

INPUT2: A Bus is stopped here. A vehicle stops here.

需要知道公共汽车是一种车辆类型以及某种时间关系。另外，如果公交车确实停车但通常不停车或不再停车怎么办？可以采取多种方式。

INPUT3: Fire in NY. NY is burnt down.

要知道火可以烧毁东西。

INPUT4: Fire in NY. 50 died in NY fire.

需要知道火可以杀死东西（见下文）。需要将“新闻标题”（50 WHAT？）与人们联系起来。大脑可以做一些微不足道的事情。计算机程序不是大脑。

而且我不是英语专业的:-)

"Semantically". No simple text-book algorithm for that. Natural language (esp. English) is a very complicated and fickle beast. Let's look at (just a small part of) the provided cases:

INPUT1: Obama signs the law. A new law is signed by Obama.

Signing a law makes it a 'new' law.

INPUT2: A Bus is stopped here. A vehicle stops here.

Need to know a bus is a type if vehicle as well as some sort of time relation. Also, what if the bus did stop but does not normally stop or is no longer stopped? It can be taken several ways.

INPUT3: Fire in NY. NY is burnt down.

Need to know that fires can burn things down.

INPUT4: Fire in NY. 50 died in NY fire.

Need to know that fires can kill things (see next). Need to associated the "news headline" (50 WHAT?) with people. The brain can do this somewhat trivially. Computer programs are not brains.

And I'm no English major :-)

回复收藏 0 原文

~没有更多了~