将姓名列表分隔为:“FirstName {TAB} Lastname”对
是否有特定的库、算法或技术(除了使用正则表达式之外) 如果您想转换/翻译以下几行,请使用。
"Acme Corporation Inc., John, Doe, F."
"Smith, Allen, Smith,Susan"
"Marshall, J., L., Johnson, H., Caruso, D., Jones, J."
"Stein, Harry, Joan, and Mike"
这些行应转换为包含以下内容的文本:
Acme {TAB} Corporation
Doe {TAB} John
Smith {TAB} Allen
Smith {TAB} Susan
Marshall {TAB} J.
Johnson {TAB} H.
Caruso {TAB} D.
Jones {TAB} J.
Stein {TAB} Harry
Stein {TAB} Joan
Stein {TAB} Mike
原始文本仅包含专有名称和中间名缩写(D. 或 J.),除了 偶尔用“and”分隔与最后一行具有相同姓氏的兄弟姐妹 以上原文。
另外,这被认为是“命名实体识别”还是还有其他一些技术 这个过程的名称?
理想情况下,我想要使用 Ruby/Python/Perl/PHP 等语言编写的代码或算法 进行此翻译。
有什么想法吗?提前致谢。
Is there a specific library, algorithm or technique (besides using Regular expressions)
to use if you want to convert/translate the following lines.
"Acme Corporation Inc., John, Doe, F."
"Smith, Allen, Smith,Susan"
"Marshall, J., L., Johnson, H., Caruso, D., Jones, J."
"Stein, Harry, Joan, and Mike"
These lines should be converted into text containing:
Acme {TAB} Corporation
Doe {TAB} John
Smith {TAB} Allen
Smith {TAB} Susan
Marshall {TAB} J.
Johnson {TAB} H.
Caruso {TAB} D.
Jones {TAB} J.
Stein {TAB} Harry
Stein {TAB} Joan
Stein {TAB} Mike
The original text contains only proper names and middle initials (D. or J.) except for
an occasional "and" separating siblings with the same last name as in the last line
of original text above.
Also, is this considered to be "Named Entity Recognition" or is there some other technical
name for this process?
Ideally, i would like code or algorithms in a language like Ruby/Python/Perl/PHP that could
make this translation.
Any Ideas? Thanks in advance.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这几乎有效:
给定示例输入的实际输出
讨论
我采用了以下启发式:
最后我使用“正则表达式”只是为了在空间上分割公司名称;这可以简单地用非正则表达式版本替换。
即使如此,我仍然得到“John Doe”错误,因为它的名字在输入中被颠倒了。我无法设计出可靠的方法来检测这一点。
This works, almost:
Actual output for given sample input
Discussion
I employed the following heuristics:
In the end I used "regular expressions" only to split corporation names on space; this could be trivially replaced with a non-regex version.
Even with all of this I still get "John Doe" wrong, because its names are reversed in the input. I couldn't devise a reliable way to detect this.