自然语言处理/文本结构分析起点
我需要解析&处理大量半结构化文本(基本上是法律文件 - 法律文本、其附录、条约、法官的判决……)。我想做的最基本的事情是提取有关子部分如何构造的信息 - 章节、文章、副标题……以及一些元数据。我的问题是是否有人可以向我指出这种类型的文本处理的起点,因为我确信对此进行了大量研究,但我发现的主要是用严格的语法(如代码)解析某些内容或完全自由格式的文本(就像谷歌试图在网页上做的那样)。我认为如果我掌握了正确的关键词,我会在谷歌和我的期刊数据库中取得更大的成功。谢谢。
I need to parse & process a big set of semi-structured text (basically, legal documents - law texts, addendums to them, treaties, judge's decisions, ...). The most fundamental thing I'm trying to do is extract information on how subparts are structured - chapters, articles, subheadings, ... plus some metadata. My question is if anyone can point me to starting points for this type of text processing, because I'm sure there has been a lot of research into this but what I find is mostly on either parsing something with a strict grammar (like code) or completely free-form text (like google tries to do on webpages). I think if I get hold of the right keywords, I would have more success in google and my journal databases. Thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
自然语言工具包可能是一个有趣的开始,并且在自然语言处理的所有领域都有大量资源。它可能比您需要的更注重语言。
另一个选择是使用一些解析器生成器库(通常用于代码),它不是那么严格(即允许您在需要时忽略大块文本)。在Python中,我推荐pyparsing。在 另一个答案中,我展示了一个简单的示例当您想忽略任意文本块时执行此操作。
The natural language toolkit may be an interesting start and has plenty of resources on all areas of natural language processing. It is probably more linguistically focused than you need.
The other option is to go for a some parser generator library (normally used for code) which is not so strict (i.e allows you to ignore big chucks of text if needed). In python I would recommend pyparsing. In another answer I showed a simple example of what it can do when you want to ignore arbitrary chucks of text.
以前从未这样做过,但如果我打算这样做,我肯定会研究ANTLR。这是一个非常受欢迎的项目,并且很可能有您选择的语言的移植版本。
Never done this before, but if I was going to I'd definitely look into ANTLR. Its a pretty popular project and could very well have a port in your language of choice.