将句子拆分成单词,但在 C# 中标点符号遇到问题

发布于 2024-12-03 10:33:06 字数 437 浏览 0 评论 0原文

我见过一些类似的问题,但我正在努力实现这一目标。

给定一个字符串,str =“月球是我们的天然卫星,即它绕地球旋转!” 我想提取单词并将它们存储在数组中。 预期的数组元素是这样的。

the 
moon 
is 
our 
natural 
satellite 
i.e. 
it  
rotates 
around 
the 
earth

我尝试使用 String.split( ','\t','\r') 但这不能正常工作。我还尝试删除 . 和其他标点符号,但我也希望解析出像“ie”这样的字符串。实现这一目标的最佳方法是什么? 我也尝试使用 regex.split 但无济于事。

string[] words = Regex.Split(line, @"\W+");

肯定会感激在正确方向上的一些推动。

I have seen a few similar questions but I am trying to achieve this.

Given a string, str="The moon is our natural satellite, i.e. it rotates around the Earth!"
I want to extract the words and store them in an array.
The expected array elements would be this.

the 
moon 
is 
our 
natural 
satellite 
i.e. 
it  
rotates 
around 
the 
earth

I tried using String.split( ','\t','\r') but this does not work correctly. I also tried removing the ., and other punctuation marks but I would want a string like "i.e." to be parsed out too. What is the best way to achieve this?
I also tried using regex.split to no avail.

string[] words = Regex.Split(line, @"\W+");

Would surely appreciate some nudges in the right direction.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

或十年 2024-12-10 10:33:06

正则表达式解决方案。

(\b[^\s]+\b)

如果您真的想要修复ie上的最后一个.,您可以使用它。

((\b[^\s]+\b)((?<=\.\w).)?)

这是我正在使用的代码。

  var input = "The moon is our natural satellite, i.e. it rotates around the Earth!";
  var matches = Regex.Matches(input, @"((\b[^\s]+\b)((?<=\.\w).)?)");

  foreach(var match in matches)
  {
     Console.WriteLine(match);
  }

结果:

<前><代码>
月亮

我们的
自然的
卫星
IE

旋转
大约

地球

A regex solution.

(\b[^\s]+\b)

And if you really want to fix that last . on i.e. you could use this.

((\b[^\s]+\b)((?<=\.\w).)?)

Here's the code I'm using.

  var input = "The moon is our natural satellite, i.e. it rotates around the Earth!";
  var matches = Regex.Matches(input, @"((\b[^\s]+\b)((?<=\.\w).)?)");

  foreach(var match in matches)
  {
     Console.WriteLine(match);
  }

Results:

The
moon
is
our
natural
satellite
i.e.
it
rotates
around
the
Earth
请爱~陌生人 2024-12-10 10:33:06

我怀疑您正在寻找的解决方案比您想象的要复杂得多。您正在寻找某种形式的实际语言分析,或者至少是一本字典,以便您可以确定句号是单词的一部分还是句子的结尾。您是否考虑过它可能两者兼而有之?

考虑添加允许的“包含标点符号的单词”的字典。这可能是解决您的问题的最简单方法。

I suspect the solution you're looking for is much more complex than you think. You're looking for some form of actual language analysis, or at a minimum a dictionary, so that you can determine whether a period is part of a word or ends a sentence. Have you considered the fact that it may do both?

Consider adding a dictionary of allowed "words that contain punctuation." This may be the simplest way to solve your problem.

寻梦旅人 2024-12-10 10:33:06

这对我有用。

var str="The moon is our natural satellite, i.e. it rotates around the Earth!";
var a = str.Split(new char[] {' ', '\t'});
for (int i=0; i < a.Length; i++)
{
    Console.WriteLine(" -{0}", a[i]);
}

结果:

 -The
 -moon
 -is
 -our
 -natural
 -satellite,
 -i.e.
 -it
 -rotates
 -around
 -the
 -Earth!

您可以对结果进行一些后处理,删除逗号和分号等。

This works for me.

var str="The moon is our natural satellite, i.e. it rotates around the Earth!";
var a = str.Split(new char[] {' ', '\t'});
for (int i=0; i < a.Length; i++)
{
    Console.WriteLine(" -{0}", a[i]);
}

Results:

 -The
 -moon
 -is
 -our
 -natural
 -satellite,
 -i.e.
 -it
 -rotates
 -around
 -the
 -Earth!

you could do some post-processing of the results, removing commas and semicolons, etc.

你的他你的她 2024-12-10 10:33:06
Regex.Matches(input, @"\b\w+\b").OfType<Match>().Select(m => m.Value)
Regex.Matches(input, @"\b\w+\b").OfType<Match>().Select(m => m.Value)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文