将句子拆分成单词,但在 C# 中标点符号遇到问题
我见过一些类似的问题,但我正在努力实现这一目标。
给定一个字符串,str =“月球是我们的天然卫星,即它绕地球旋转!” 我想提取单词并将它们存储在数组中。 预期的数组元素是这样的。
the
moon
is
our
natural
satellite
i.e.
it
rotates
around
the
earth
我尝试使用 String.split( ','\t','\r') 但这不能正常工作。我还尝试删除 . 和其他标点符号,但我也希望解析出像“ie”这样的字符串。实现这一目标的最佳方法是什么? 我也尝试使用 regex.split 但无济于事。
string[] words = Regex.Split(line, @"\W+");
肯定会感激在正确方向上的一些推动。
I have seen a few similar questions but I am trying to achieve this.
Given a string, str="The moon is our natural satellite, i.e. it rotates around the Earth!"
I want to extract the words and store them in an array.
The expected array elements would be this.
the
moon
is
our
natural
satellite
i.e.
it
rotates
around
the
earth
I tried using String.split( ','\t','\r') but this does not work correctly. I also tried removing the ., and other punctuation marks but I would want a string like "i.e." to be parsed out too. What is the best way to achieve this?
I also tried using regex.split to no avail.
string[] words = Regex.Split(line, @"\W+");
Would surely appreciate some nudges in the right direction.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
正则表达式解决方案。
如果您真的想要修复
ie
上的最后一个.
,您可以使用它。这是我正在使用的代码。
结果:
A regex solution.
And if you really want to fix that last
.
oni.e.
you could use this.Here's the code I'm using.
Results:
我怀疑您正在寻找的解决方案比您想象的要复杂得多。您正在寻找某种形式的实际语言分析,或者至少是一本字典,以便您可以确定句号是单词的一部分还是句子的结尾。您是否考虑过它可能两者兼而有之?
考虑添加允许的“包含标点符号的单词”的字典。这可能是解决您的问题的最简单方法。
I suspect the solution you're looking for is much more complex than you think. You're looking for some form of actual language analysis, or at a minimum a dictionary, so that you can determine whether a period is part of a word or ends a sentence. Have you considered the fact that it may do both?
Consider adding a dictionary of allowed "words that contain punctuation." This may be the simplest way to solve your problem.
这对我有用。
结果:
您可以对结果进行一些后处理,删除逗号和分号等。
This works for me.
Results:
you could do some post-processing of the results, removing commas and semicolons, etc.