Treebank 的 Python 数据结构?
我正在寻找一个处理 Penn Treebank 结构的 Python 数据结构。这是 Treebank 的示例:
( (S
(NP-SBJ (PRP He) )
(VP (VBD shouted) )
(. .) ))
本质上,我想要一个数据结构,我可以询问诸如“主题 NP 的孩子是什么?”之类的问题。或“哪些类型的短语主导代词?”,最好是用 Python 语言。有人知道吗?
I'm looking for a Python data structure that handles the Penn Treebank structure. This is a sample of what the Treebank looks like:
( (S
(NP-SBJ (PRP He) )
(VP (VBD shouted) )
(. .) ))
Essentially, I would like a data structure that I can ask things like "What are the children of the subject NP?" or "What types of phrases dominate the pronoun?", preferably in Python. Does anyone have a clue?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
NLTK 模块 可能是在 Python 中实现 Penn Treebank 和其他 NLP 相关内容的良好开端。
NLTK modules might be a good start to implement Penn Treebank and other NLP related stuff in Python.
我仍然建议使用 NLTK 来读取树库(参见例如 这篇博文),但我可以想象它不支持这种一般查询。
这将是一个
dict
,例如children
,将非终结符映射到非终结符或子节点的集
。这将是另一个
dict
,例如parents
,将非终结符映射到非终结符的集合
。您可能想要构建树节点的关系数据库。确切的模式取决于您想要询问哪种类型的查询,但请务必查看 Python
sqlite3
模块。或者,您可以以 XML 和 使用XPath来查询。恕我直言,LXML 是 Python 最好的 XML/XPath 库。
I still suggest using NLTK to read the treebank (see e.g. this blog post), but I can imagine it doesn't support this kind of general queries.
This would be a
dict
, saychildren
, mapping nonterminals tosets
of either nonterminals or child nodes.This would be another
dict
, sayparents
, mapping nonterminals tosets
of nonterminals.You might want to build a relational database of tree nodes. The exact schema would depend on what kind of queries you want to ask, but be sure to check out the Python
sqlite3
module.Alternatively, you can recode the treebank in XML and use XPath to query it. LXML is the best XML/XPath library for Python, IMHO.