我如何筛选各种' a'刮去网站时的标签?
我试图刮擦一个存储田径时间的网站,以获取每个赛季的给定运动员的清单,他们参加的每个活动以及每次活动时的每次。
到目前为止,我已经打印了季节的标题和每个活动的名称。我现在试图筛选a
标签的海洋以找到时间。我已经尝试使用find_next('a')
和find_next_sibling('a')
,但正在努力隔离时间。
for text in soup.find_all('h5'):
#print season titles and event name neatly
if "Season" in str(text):
text_file.write(('\n' + '\n' + str(text.contents[0])) + '\n')
else:
text_file.write(str(text.contents[0]) + '\n')
#print all siblings
for i in range(0,100):
try:
text = text.find_next_sibling()
text_file.write(str(text) + '\n')
except:
print("miss")
到目前为止,我所能做的就是打印所有兄弟姐妹,其中所有兄弟姐妹都包含在其中。例如:
<table class="table table-sm table-responsive table-hover"><tbody><tr id="rID_162222827"><td class="text-nowrap" style="width:35px;"><a href="" ng-click="appC.edit(162222827)" ng-if="appC.params.edit=='mark' && appC.params.canEdit"><i class="far fa-pencil text-primary mr-2"></i></a><i>9 </i></td><td style="width:110px;"><span class="mRight5 m-1"><i class="far fa-fw"></i></span><a href="/result/AWi088nH1S1rZxdSN">2:10.97</a></td><td class="text-nowrap" style="width: 60px;">Mar 4</td><td><a href="meet/443782#53587">Sunset Invitational</a></td><td class="text-muted text-right text-nowrap">O F</td></tr><tr id="rID_164098252"><td class="text-nowrap" style="width:35px;"><a href="" ng-click="appC.edit(164098252)" ng-if="appC.params.edit=='mark' && appC.params.canEdit"><i class="far fa-pencil text-primary mr-2"></i></a><i>60 </i></td><td style="width:110px;"><span class="mRight5 m-1"><i class="far fa-fw"></i></span><a href="/result/R3iEDYqsnSQ48l0h8">2:05.56</a><a href="/post/R3iEDYqsnSQ48l0h8" rel="nofollow"><small class="text-muted pr-text" style="font-weight:normal; margin-left: 4px;" uib-tooltip="Personal Record">PR</small></a></td><td class="text-nowrap" style="width: 60px;">Mar 19</td><td><a href="meet/441280#53587">Dublin Distance Fiesta</a></td><td class="text-muted text-right text-nowrap">V F</td></tr><tr id="rID_164212389"><td class="text-nowrap" style="width:35px;"><a href="" ng-click="appC.edit(164212389)" ng-if="appC.params.edit=='mark' && appC.params.canEdit"><i class="far fa-pencil text-primary mr-2"></i></a><i>3 </i></td><td style="width:110px;"><span class="mRight5 m-1"><i class="far fa-fw"></i></span><a href="/result/16ik6rkINSN6VJEHy">2:18.54</a></td><td class="text-nowrap" style="width: 60px;">Mar 26</td><td><a href="meet/459101#53587">PSAL League Meet #1</a></td><td class="text-muted text-right text-nowrap">V F</td></tr><tr id="rID_174827223"><td class="text-nowrap" style="width:35px;"><a href="" ng-click="appC.edit(174827223)" ng-if="appC.params.edit=='mark' && appC.params.canEdit"><i class="far fa-pencil text-primary mr-2"></i></a><i>26 </i></td><td style="width:110px;"><span class="mRight5 m-1"><i class="far fa-fw"></i></span><a href="/result/gBivaaKIVSZRLp2UA">2:10.58</a></td><td class="text-nowrap" style="width: 60px;">Apr 9</td><td><a href="meet/443768#53587">Cupertino/De Anza Invite</a></td><td class="text-muted text-right text-nowrap">V F</td></tr><tr id="rID_168470829"><td class="text-nowrap" style="width:35px;"><a href="" ng-click="appC.edit(168470829)" ng-if="appC.params.edit=='mark' && appC.params.canEdit"><i class="far fa-pencil text-primary mr-2"></i></a><i>50 </i></td><td style="width:110px;"><span class="mRight5 m-1"><i class="far fa-fw"></i></span><a href="/result/vOi3ydru3SxDoBWso">2:13.20</a></td><td class="text-nowrap" style="width: 60px;">Apr 16</td><td><a href="meet/445132#53587">Granada Distance & Sprint Festival</a></td><td class="text-muted text-right text-nowrap">O F</td></tr></tbody></table>
这个输出在最近的赛季中为这位运动员提供了一场比赛。
如何筛选仅隔离不包含时间的各种a
标签的时间?
如果我使用find_next_sibling('a')
它只打印none
。
I'm trying to scrape athletic.net, a site that stores track and field times, to get a list for a given athlete of each season, each event that they ran, and every time they got for each event.
So far I have printed the season title and the name of each event. I'm now trying to sift through a sea of a
tags to find the times. I've tried using find_next('a')
and find_next_sibling('a')
but am struggling to isolate the times.
for text in soup.find_all('h5'):
#print season titles and event name neatly
if "Season" in str(text):
text_file.write(('\n' + '\n' + str(text.contents[0])) + '\n')
else:
text_file.write(str(text.contents[0]) + '\n')
#print all siblings
for i in range(0,100):
try:
text = text.find_next_sibling()
text_file.write(str(text) + '\n')
except:
print("miss")
So far all I can do is print all siblings, which contains all times within it. For example:
<table class="table table-sm table-responsive table-hover"><tbody><tr id="rID_162222827"><td class="text-nowrap" style="width:35px;"><a href="" ng-click="appC.edit(162222827)" ng-if="appC.params.edit=='mark' && appC.params.canEdit"><i class="far fa-pencil text-primary mr-2"></i></a><i>9 </i></td><td style="width:110px;"><span class="mRight5 m-1"><i class="far fa-fw"></i></span><a href="/result/AWi088nH1S1rZxdSN">2:10.97</a></td><td class="text-nowrap" style="width: 60px;">Mar 4</td><td><a href="meet/443782#53587">Sunset Invitational</a></td><td class="text-muted text-right text-nowrap">O F</td></tr><tr id="rID_164098252"><td class="text-nowrap" style="width:35px;"><a href="" ng-click="appC.edit(164098252)" ng-if="appC.params.edit=='mark' && appC.params.canEdit"><i class="far fa-pencil text-primary mr-2"></i></a><i>60 </i></td><td style="width:110px;"><span class="mRight5 m-1"><i class="far fa-fw"></i></span><a href="/result/R3iEDYqsnSQ48l0h8">2:05.56</a><a href="/post/R3iEDYqsnSQ48l0h8" rel="nofollow"><small class="text-muted pr-text" style="font-weight:normal; margin-left: 4px;" uib-tooltip="Personal Record">PR</small></a></td><td class="text-nowrap" style="width: 60px;">Mar 19</td><td><a href="meet/441280#53587">Dublin Distance Fiesta</a></td><td class="text-muted text-right text-nowrap">V F</td></tr><tr id="rID_164212389"><td class="text-nowrap" style="width:35px;"><a href="" ng-click="appC.edit(164212389)" ng-if="appC.params.edit=='mark' && appC.params.canEdit"><i class="far fa-pencil text-primary mr-2"></i></a><i>3 </i></td><td style="width:110px;"><span class="mRight5 m-1"><i class="far fa-fw"></i></span><a href="/result/16ik6rkINSN6VJEHy">2:18.54</a></td><td class="text-nowrap" style="width: 60px;">Mar 26</td><td><a href="meet/459101#53587">PSAL League Meet #1</a></td><td class="text-muted text-right text-nowrap">V F</td></tr><tr id="rID_174827223"><td class="text-nowrap" style="width:35px;"><a href="" ng-click="appC.edit(174827223)" ng-if="appC.params.edit=='mark' && appC.params.canEdit"><i class="far fa-pencil text-primary mr-2"></i></a><i>26 </i></td><td style="width:110px;"><span class="mRight5 m-1"><i class="far fa-fw"></i></span><a href="/result/gBivaaKIVSZRLp2UA">2:10.58</a></td><td class="text-nowrap" style="width: 60px;">Apr 9</td><td><a href="meet/443768#53587">Cupertino/De Anza Invite</a></td><td class="text-muted text-right text-nowrap">V F</td></tr><tr id="rID_168470829"><td class="text-nowrap" style="width:35px;"><a href="" ng-click="appC.edit(168470829)" ng-if="appC.params.edit=='mark' && appC.params.canEdit"><i class="far fa-pencil text-primary mr-2"></i></a><i>50 </i></td><td style="width:110px;"><span class="mRight5 m-1"><i class="far fa-fw"></i></span><a href="/result/vOi3ydru3SxDoBWso">2:13.20</a></td><td class="text-nowrap" style="width: 60px;">Apr 16</td><td><a href="meet/445132#53587">Granada Distance & Sprint Festival</a></td><td class="text-muted text-right text-nowrap">O F</td></tr></tbody></table>
This output has all of the times for one event for this athlete in their most recent season.
How can I sift through to isolate only the times when there are various a
tags that don't contain times?
If I use find_next_sibling('a')
it only prints None
.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
问题需要一些改进,专注并应提供预期的产出,还不清楚。
您可以使用
CSS选择器
在时间上获取所有&lt; a&gt;
:或更具体的
示例
输出
Question needs some improvment, focus and should provide expected output, it is not quite clear.
You could use
css selectors
to get all the<a>
with time:or more specific
Example
Output