如何循环多个文件,保留基本名称以供进一步处理?
我有多个需要标记化的文本文件,POS 和 NER。我正在使用 C&C 标记器并运行了他们的教程,但我我想知道是否有一种方法可以标记多个文件而不是一个一个地标记。
目前,我正在对文件进行标记:
bin/tokkie --input working/tutorial/example.txt--quotes delete --output working/tutorial/example.tok
如下所示,然后是部分语音标记:
bin/pos --input working/tutorial/example.tok --model models/pos --output working/tutorial/example.pos
最后是命名实体识别:
bin/ner --input working/tutorial/example.pos --model models/ner --output working/tutorial/example.ner
我不确定如何创建一个循环来执行此操作并保持文件名与输入相同,但是扩展名代表它所具有的标签。我正在考虑使用 bash 脚本或 Perl 来打开目录,但我不确定如何输入 C&C 命令以使脚本能够理解。
目前我正在手动执行此操作,至少可以说非常耗时!
I have multiple text files that need to be tokenised, POS and NER. I am using C&C taggers and have run their tutorial, but I am wondering if there is a way to tag multiple files rather than one by one.
At the moment I am tokenising the files:
bin/tokkie --input working/tutorial/example.txt--quotes delete --output working/tutorial/example.tok
as follows and then Part of Speech tagging:
bin/pos --input working/tutorial/example.tok --model models/pos --output working/tutorial/example.pos
and lastly Named Entity Recognition:
bin/ner --input working/tutorial/example.pos --model models/ner --output working/tutorial/example.ner
I am not sure how I would go about creating a loop to do this and keep the file name the same as the input but with the extension representing the tagging it has. I was thinking of a bash script or perhaps Perl to open the directory but I am not sure on how to enter the C&C commands in order for the script to understand.
At the moment I am doing it manually and it's pretty time consuming to say the least!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
未经测试,可能需要一些目录修改。
Untested, likely needs some directory mangling.
在重击中:
In Bash: