使用 python 正则表达式从 Craigslist rss feed 中提取地址
我正在绞尽脑汁地尝试解析 craigslist RSS feed 以提取位置信息。
我使用 feedparser 将脚本解析为条目和条目描述。不幸的是,地址信息包含在描述部分的不规则标签中。
地址包含在如下所示的部分中:
<!-- CLTAG xstreet0=11832 se 318pl -->
<!-- CLTAG xstreet1= -->
<!-- CLTAG city=auburn -->
<!-- CLTAG region=wa -->
11832 se 318pl
Feedparser 不喜欢这些 CLTAGS。我尝试用正则表达式捕获第一行,如下所示:
addressStart = r'!-- CLTAG xstreet0='
addressEnd = r'-->'
prog = re.compile(addressStart(.*?)addressEnd)
result = prog.match(string)
...但这不起作用。我做错了什么?这是我正在使用的 rss feed 的链接“http://seattle.craigslist.org/see/apa/index.rss”
非常感谢任何帮助!
I'm pulling my hair out trying to parse out a craigslist rss feed to extract location information.
I used feedparser to parse the script into into entries and entry descriptions. Unfortunately the address information is contained in irregular tags within the description section.
the addresses are contained in a section that looks like this:
<!-- CLTAG xstreet0=11832 se 318pl -->
<!-- CLTAG xstreet1= -->
<!-- CLTAG city=auburn -->
<!-- CLTAG region=wa -->
11832 se 318pl
Feedparser doesn't like those CLTAGS. My attempt to capture the first line with regex looked like this:
addressStart = r'!-- CLTAG xstreet0='
addressEnd = r'-->'
prog = re.compile(addressStart(.*?)addressEnd)
result = prog.match(string)
...But that didn't work. What am I doing wrong? here is a link to the rss feed I'm working with 'http://seattle.craigslist.org/see/apa/index.rss'
Any help is greatly appreciated!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这是一些无效的语法。除非字符串被引号括起来,否则无法连接/格式化字符串。尝试:
That's some invalid syntax. You cannot concatenate/format strings unless the strings are quoted. Try:
尝试使用
search
而不是match
(原因是该行以<
开头,但您将addressStart
定义为以!
开头。search
可在字符串中的任何位置查找匹配项,而match
仅在开头查找匹配项。 code>addressStart 包含前导<代码><。)Try
search
instead ofmatch
(The reason is that the line starts with a<
but you definedaddressStart
to begin with the!
.search
finds a match anywhere in the string, whilematch
only finds matches at the beginning. Alternatively you could have redefinedaddressStart
to contain the leading<
.)