协助查找和替换正则表达式
我有一个文本文件,每一行的形式如下:
TAB WORD TAB PoS TAB FREQ#
Word PoS Freq
the Det 61847
of Prep 29391
and Conj 26817
a Det 21626
in Prep 18214
to Inf 16284
it Pron 10875
is Verb 9982
to Prep 9343
was Verb 9236
I Pron 8875
for Prep 8412
that Conj 7308
you Pron 6954
你们中的一位正则表达式向导能否帮助我从文件中分离出单词?希望我会在 TextPad 中进行查找和替换,仅此而已。多次查找和替换就可以了。一件事:请注意,搜索“动词”也会出现“动词”的单词,而不仅仅是词性,所以要小心。最后我想每行 1 个单词。
非常感谢!
I have a text file, and each line is of the form:
TAB WORD TAB PoS TAB FREQ#
Word PoS Freq
the Det 61847
of Prep 29391
and Conj 26817
a Det 21626
in Prep 18214
to Inf 16284
it Pron 10875
is Verb 9982
to Prep 9343
was Verb 9236
I Pron 8875
for Prep 8412
that Conj 7308
you Pron 6954
Would one of you regex wizards kindly assist me in isolating the WORDS from the file? I'll do a find and replace in TextPad, hopefully, and that will be that. Multiple find and replaces is fine. One thing: notice that searching for "verb" would also turn up the WORD of "verb," not just the part of speech, so be carefull. In the end I want to end up with 1 word per line.
Thanks so much!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我认为微软Excel可以更好地帮助你...
只需复制Excel上的整个文本,它将被格式化为表格,然后继续为单词选择适当的列单元格,最后将它们复制到记事本上。
我敢打赌这是最简单的路径。
如果 Excel 将所有值存储在单个列中,则在单独的列中通过以下方式提取单词:
=Trim(LEFT(C1,maxchar))
I think microsoft excel can help you that better...
Just copy the whole text on excel and it will be formatted as table then go ahead and select the appropriate column cells for the word, finally copy them on notepad.
I bet this is the easiest path.
If in case excel stores all values in a single column, in a separate column extract the word by:
=Trim(LEFT(C1,maxchar))
您可以使用
awk
删除第一列,如使用 Skip the first line
You could just use
awk
to remove the first column, as inSkip the first line by using
实际上没有必要为此使用正则表达式。例如,您可以使用
cut
:There's not really any need to use a regular expression for this. For example, you can use
cut
:像
\s*([a-zA-z]+)\s*([a-zA-z]+)
之类的东西会将单词和 PoS 作为组返回。然后,您可以在替换语句中将它们用作 $1 和 $2 来根据需要输出。如果你只想要 WORD 部分,你可以在替换中使用 $1 。
Something like
\s*([a-zA-z]+)\s*([a-zA-z]+)
would return the word and PoS as groups. You can then use them in the replace statement as $1 and $2 to output as you want.If you only want the WORD part you can just use $1 in the replace.