使用 python 正则表达式从 Craigslist rss feed 中提取地址

发布于 2024-12-03 12:13:58 字数 704 浏览 0 评论 0原文

我正在绞尽脑汁地尝试解析 craigslist RSS feed 以提取位置信息。

我使用 feedparser 将脚本解析为条目和条目描述。不幸的是，地址信息包含在描述部分的不规则标签中。

地址包含在如下所示的部分中：

<!-- CLTAG xstreet0=11832 se 318pl  -->
<!-- CLTAG xstreet1= -->
<!-- CLTAG city=auburn -->
<!-- CLTAG region=wa -->
11832 se 318pl

Feedparser 不喜欢这些 CLTAGS。我尝试用正则表达式捕获第一行，如下所示：

addressStart = r'!-- CLTAG xstreet0='
addressEnd = r'-->'

prog = re.compile(addressStart(.*?)addressEnd)
result = prog.match(string)

...但这不起作用。我做错了什么？这是我正在使用的 rss feed 的链接“http://seattle.craigslist.org/see/apa/index.rss”

非常感谢任何帮助！

原文

I'm pulling my hair out trying to parse out a craigslist rss feed to extract location information.

I used feedparser to parse the script into into entries and entry descriptions. Unfortunately the address information is contained in irregular tags within the description section.

the addresses are contained in a section that looks like this:

<!-- CLTAG xstreet0=11832 se 318pl  -->
<!-- CLTAG xstreet1= -->
<!-- CLTAG city=auburn -->
<!-- CLTAG region=wa -->
11832 se 318pl

Feedparser doesn't like those CLTAGS. My attempt to capture the first line with regex looked like this:

addressStart = r'!-- CLTAG xstreet0='
addressEnd = r'-->'

prog = re.compile(addressStart(.*?)addressEnd)
result = prog.match(string)

...But that didn't work. What am I doing wrong? here is a link to the rss feed I'm working with 'http://seattle.craigslist.org/see/apa/index.rss'

Any help is greatly appreciated!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

瀟灑尐姊 2024-12-10 12:13:59

这是一些无效的语法。除非字符串被引号括起来，否则无法连接/格式化字符串。尝试：

addressStart = r'!-- CLTAG xstreet0='
addressEnd = r'-->'

prog = re.compile(addressStart + r'(.*?)' + addressEnd)
result = prog.match(string)

That's some invalid syntax. You cannot concatenate/format strings unless the strings are quoted. Try:

addressStart = r'!-- CLTAG xstreet0='
addressEnd = r'-->'

prog = re.compile(addressStart + r'(.*?)' + addressEnd)
result = prog.match(string)

回复收藏 0 原文

画中仙 2024-12-10 12:13:59

尝试使用 search 而不是 match （原因是该行以 < 开头，但您将 addressStart 定义为以 ! 开头。search 可在字符串中的任何位置查找匹配项，而 match 仅在开头查找匹配项。 code>addressStart 包含前导<代码><。）

>>> import re;
>>> addressStart = r'!-- CLTAG xstreet0='
>>> addressEnd = r'-->'
>>> prog = re.compile(addressStart + "(.*?)" + addressEnd)
>>> string = "<!-- CLTAG xstreet0=11832 se 318pl  -->"
>>> result = re.search(prog, string)
>>> result
<_sre.SRE_Match object at 0x1004806c0>
>>> result.group(1)
'11832 se 318pl  '

Try search instead of match (The reason is that the line starts with a < but you defined addressStart to begin with the !. search finds a match anywhere in the string, while match only finds matches at the beginning. Alternatively you could have redefined addressStart to contain the leading <.)

>>> import re;
>>> addressStart = r'!-- CLTAG xstreet0='
>>> addressEnd = r'-->'
>>> prog = re.compile(addressStart + "(.*?)" + addressEnd)
>>> string = "<!-- CLTAG xstreet0=11832 se 318pl  -->"
>>> result = re.search(prog, string)
>>> result
<_sre.SRE_Match object at 0x1004806c0>
>>> result.group(1)
'11832 se 318pl  '

回复收藏 0 原文

~没有更多了~