用于解析体育比赛数据的自然语言解析器
我正在尝试设计一个足球比赛的解析器。我在这里非常宽松地使用“自然语言”一词,所以请耐心等待,因为我对这个领域知之甚少甚至一无所知。
以下是我正在使用的一些示例 (格式:TIME|DOWN&DIST|OFF_TEAM|DESCRIPTION):
04:39|4th and 20@NYJ46|Dal|Mat McBriar punts for 32 yards to NYJ14. Jeremy Kerley - no return. FUMBLE, recovered by NYJ.|
04:31|1st and 10@NYJ16|NYJ|Shonn Greene rush up the middle for 5 yards to the NYJ21. Tackled by Keith Brooking.|
03:53|2nd and 5@NYJ21|NYJ|Mark Sanchez rush to the right for 3 yards to the NYJ24. Tackled by Anthony Spencer. FUMBLE, recovered by NYJ (Matthew Mulligan).|
03:20|1st and 10@NYJ33|NYJ|Shonn Greene rush to the left for 4 yards to the NYJ37. Tackled by Jason Hatcher.|
02:43|2nd and 6@NYJ37|NYJ|Mark Sanchez pass to the left to Shonn Greene for 7 yards to the NYJ44. Tackled by Mike Jenkins.|
02:02|1st and 10@NYJ44|NYJ|Shonn Greene rush to the right for 1 yard to the NYJ45. Tackled by Anthony Spencer.|
01:23|2nd and 9@NYJ45|NYJ|Mark Sanchez pass to the left to LaDainian Tomlinson for 5 yards to the 50. Tackled by Sean Lee.|
到目前为止,我已经编写了一个愚蠢的解析器,可以处理所有简单的内容(playID、季度、时间、down&距离、进攻球队)以及一些脚本获取这些数据并将其整理为上面看到的格式。单行变成一个“播放”对象并存储到数据库中。
这里最困难的部分(至少对我来说)是解析该剧的描述。以下是我想从该字符串中提取的一些信息:
示例字符串:
"Mark Sanchez pass to the left to Shonn Greene for 7 yards to the NYJ44. Tackled by Mike Jenkins."
结果:
turnover = False
interception = False
fumble = False
to_on_downs = False
passing = True
rushing = False
direction = 'left'
loss = False
penalty = False
scored = False
TD = False
PA = False
FG = False
TPC = False
SFTY = False
punt = False
kickoff = False
ret_yardage = 0
yardage_diff = 7
playmakers = ['Mark Sanchez', 'Shonn Greene', 'Mike Jenkins']
我最初的解析器的逻辑是这样的:
# pass, rush or kick
# gain or loss of yards
# scoring play
# Who scored? off or def?
# TD, PA, FG, TPC, SFTY?
# first down gained
# punt?
# kick?
# return yards?
# penalty?
# def or off?
# turnover?
# INT, fumble, to on downs?
# off play makers
# def play makers
描述可能会变得非常复杂(多次摸索和恢复以及惩罚等)我想知道我是否可以利用一些 NLP 模块。我很可能会花几天时间在像解析器这样的哑/静态状态机上,但如果有人对如何使用 NLP 技术来处理它有建议,我想听听他们的建议。
I'm trying to come up with a parser for football plays. I use the term "natural language" here very loosely so please bear with me as I know little to nothing about this field.
Here are some examples of what I'm working with
(Format: TIME|DOWN&DIST|OFF_TEAM|DESCRIPTION):
04:39|4th and 20@NYJ46|Dal|Mat McBriar punts for 32 yards to NYJ14. Jeremy Kerley - no return. FUMBLE, recovered by NYJ.|
04:31|1st and 10@NYJ16|NYJ|Shonn Greene rush up the middle for 5 yards to the NYJ21. Tackled by Keith Brooking.|
03:53|2nd and 5@NYJ21|NYJ|Mark Sanchez rush to the right for 3 yards to the NYJ24. Tackled by Anthony Spencer. FUMBLE, recovered by NYJ (Matthew Mulligan).|
03:20|1st and 10@NYJ33|NYJ|Shonn Greene rush to the left for 4 yards to the NYJ37. Tackled by Jason Hatcher.|
02:43|2nd and 6@NYJ37|NYJ|Mark Sanchez pass to the left to Shonn Greene for 7 yards to the NYJ44. Tackled by Mike Jenkins.|
02:02|1st and 10@NYJ44|NYJ|Shonn Greene rush to the right for 1 yard to the NYJ45. Tackled by Anthony Spencer.|
01:23|2nd and 9@NYJ45|NYJ|Mark Sanchez pass to the left to LaDainian Tomlinson for 5 yards to the 50. Tackled by Sean Lee.|
As of now, I've written a dumb parser that handles all the easy stuff (playID, quarter, time, down&distance, offensive team) along with some scripts that goes and gets this data and sanitizes it into the format seen above. A single line gets turned into a "Play" object to be stored into a database.
The tough part here (for me at least) is parsing the description of the play. Here is some information that I would like to extract from that string:
Example string:
"Mark Sanchez pass to the left to Shonn Greene for 7 yards to the NYJ44. Tackled by Mike Jenkins."
Result:
turnover = False
interception = False
fumble = False
to_on_downs = False
passing = True
rushing = False
direction = 'left'
loss = False
penalty = False
scored = False
TD = False
PA = False
FG = False
TPC = False
SFTY = False
punt = False
kickoff = False
ret_yardage = 0
yardage_diff = 7
playmakers = ['Mark Sanchez', 'Shonn Greene', 'Mike Jenkins']
The logic that I had for my initial parser went something like this:
# pass, rush or kick
# gain or loss of yards
# scoring play
# Who scored? off or def?
# TD, PA, FG, TPC, SFTY?
# first down gained
# punt?
# kick?
# return yards?
# penalty?
# def or off?
# turnover?
# INT, fumble, to on downs?
# off play makers
# def play makers
The descriptions can get pretty hairy (multiple fumbles & recoveries with penalties, etc) and I was wondering if I could take advantage of some NLP modules out there. Chances are I'm going to spend a few days on a dumb/static state-machine like parser instead but if anyone has suggestions on how to approach it using NLP techniques I'd like to hear about them.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我认为 pyparsing 在这里会非常有用。
您的输入文本看起来非常规则(与真正的自然语言不同),而 pyparsing 在这方面非常擅长。你应该看看它。
例如,要解析以下句子:
您可以使用类似的内容定义一个解析句子(在文档中查找确切的语法):
并且 pyparsing 将使用此模式破坏字符串。它还将返回一个字典,其中包含从句子中提取的项目名称、动作和距离。
I think pyparsing would be very useful here.
Your input text looks very regular (unlike real natural language), and pyparsing is great at this stuff. you should have a look at it.
For example to parse the following sentences:
You would define a parse sentence with something like(look for exact syntax in docs):
And pyparsing would break strings using this pattern. It will also return a dictionary with the items name, action and distance - extracted from the sentence.
我认为 pyparsing 会工作得很好,但基于规则的系统非常脆弱。所以,如果你超越足球,你可能会遇到一些麻烦。
我认为对于这种情况更好的解决方案是词性标注器和球员姓名、位置和其他运动术语的词典(阅读词典)。将其放入您最喜欢的机器学习工具中,找出好的功能,我认为它会做得很好。
NTLK 是 NLP 的一个很好的起点。不幸的是,这个领域还不是很发达,也没有一个像 bam 那样简单、简单、解决问题的工具。
I imagine pyparsing would work pretty well, but rule-based systems are pretty brittle. So, if you go beyond football, you might run into some trouble.
I think a better solution for this case would be a part of speech tagger and a lexicon (read dictionary) of player names, positions and other sport terminology. Dump it into your favorite machine learning tool, figure out good features and I think it'd do pretty well.
NTLK is a good place to start for NLP. Unfortunately, the field isn't very developed and there isn't a tool out there that's like bam, problem solved, easy cheesy.