当前位置：文江博客话题详情

麻烦在非常大的字符串中找到链接

发布于 2025-02-04 12:33:57 字数 1877 浏览 1 评论 0 原文

我正在为数据科学项目抓取棒球参考，并在试图从特定联赛中刮掉玩家数据时出现了一个问题。 JSUT本赛季开始比赛的联赛。当我刮擦已经完成比赛的旧联赛时，我没有任何问题。但是我想在此链接上刮擦联赛：随着赛季的发展而活。但是，链接隐藏在许多似乎是纯文本的背后。因此，beautifulsoup.find_all（'a'，href = true）不起作用。

因此，这是我到目前为止的思考过程。

html = BeautifulSoup(requests.get('https://www.baseball-reference.com/register/league.cgi?id=c346199a').text, features = 'html.parser').find_all('div')
ind = [str(div) for div in html][0]
orig_ind = ind[ind.find('/register/team.cgi?id='):]
count = orig_ind.count('/register/team.cgi?id=')

team_links = []
for i in range(count):
  # rn finds the same one over and over
  link = orig_ind[orig_ind.find('/register/team.cgi?id='):orig_ind.find('title')].strip().replace('"', '')
  # try to remove it from orig_ind and do the next link...
  # this is the part that is not working rn
  orig_ind = orig_ind.replace(link, '')
  team_links.append('https://baseball-reference.com' + link)

输出：

['https://baseball-reference.com/register/team.cgi?id=71fe19cd',
 'https://baseball-reference.com',
 'https://baseball-reference.com',
 'https://baseball-reference.com',
 'https://baseball-reference.com',
 'https://baseball-reference.com',
 'https://baseball-reference.com',
 'https://baseball-reference.com',
 'https://baseball-reference.com',
 'https://baseball-reference.com',
 'https://baseball-reference.com',

等等。我正在尝试从此页面获取所有团队链接： https：> https： //www.baseball-reference.com/register/league.cgi?id=c346199a

，然后爬网上到每个页面上的播放器链接并收集一些数据。就像我说的那样，除此之外，它几乎适用于我尝试过的每个联赛。

任何帮助都是大大应用的。

原文

I am scraping baseball reference for a data science project, and have come a cross an issue when trying to scrape player data from a specific league. A league that jsut started playing this season. When I scrape old leagues that have already finished playing I have no issues. But I want to scrape the league at this link: https://www.baseball-reference.com/register/league.cgi?id=c346199a live as the season goes. However the links are hidden behind a lot of what seem to be plain text. So BeautifulSoup.find_all('a', href = True) does not work.

So instead here is what my thought process has been so far.

html = BeautifulSoup(requests.get('https://www.baseball-reference.com/register/league.cgi?id=c346199a').text, features = 'html.parser').find_all('div')
ind = [str(div) for div in html][0]
orig_ind = ind[ind.find('/register/team.cgi?id='):]
count = orig_ind.count('/register/team.cgi?id=')

team_links = []
for i in range(count):
  # rn finds the same one over and over
  link = orig_ind[orig_ind.find('/register/team.cgi?id='):orig_ind.find('title')].strip().replace('"', '')
  # try to remove it from orig_ind and do the next link...
  # this is the part that is not working rn
  orig_ind = orig_ind.replace(link, '')
  team_links.append('https://baseball-reference.com' + link)

Which outputs:

['https://baseball-reference.com/register/team.cgi?id=71fe19cd',
 'https://baseball-reference.com',
 'https://baseball-reference.com',
 'https://baseball-reference.com',
 'https://baseball-reference.com',
 'https://baseball-reference.com',
 'https://baseball-reference.com',
 'https://baseball-reference.com',
 'https://baseball-reference.com',
 'https://baseball-reference.com',
 'https://baseball-reference.com',

and so on. I am trying to get all of the team links from this page: https://www.baseball-reference.com/register/league.cgi?id=c346199a

and then crawl over to the player links on each of those pages and collect some data. Like I said it works on pretty much every single league I have ever tried on except for this one.

Any help is greatly appriciated.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

空城之時有危險 2025-02-11 12:33:57

您在此站点上看到的表位于HTML注释（＆lt;！ - ... - ＆gt; ）中，因此很漂亮，因此通常不会看到它们。要解析他们尝试下一个示例：

import requests
from bs4 import BeautifulSoup, Comment


soup = BeautifulSoup(
    requests.get(
        "https://www.baseball-reference.com/register/league.cgi?id=c346199a"
    ).text,
    features="html.parser",
)

s = "".join(c for c in soup.find_all(text=Comment) if "table_container" in c)
soup = BeautifulSoup(s, "html.parser")

for a in soup.select('[href*="/register/team.cgi?id="]'):
    print("{:<30} {}".format(a.text, a["href"]))

打印：

Battle Creek Bombers           /register/team.cgi?id=f3c4b615
Kenosha Kingfish               /register/team.cgi?id=71fe19cd
Kokomo Jackrabbits             /register/team.cgi?id=8f1a41fc
Rockford Rivets                /register/team.cgi?id=9f4fe2ef
Traverse City Pit Spitters     /register/team.cgi?id=7bc8d111
Kalamazoo Growlers             /register/team.cgi?id=9995d2a1
Fond du Lac Dock Spiders       /register/team.cgi?id=02911efc

...and so on.

The tables you see on this site is stored inside HTML comments () so BeautifulSoup normally doesn't see them. To parse them try next example:

import requests
from bs4 import BeautifulSoup, Comment


soup = BeautifulSoup(
    requests.get(
        "https://www.baseball-reference.com/register/league.cgi?id=c346199a"
    ).text,
    features="html.parser",
)

s = "".join(c for c in soup.find_all(text=Comment) if "table_container" in c)
soup = BeautifulSoup(s, "html.parser")

for a in soup.select('[href*="/register/team.cgi?id="]'):
    print("{:<30} {}".format(a.text, a["href"]))

Prints:

Battle Creek Bombers           /register/team.cgi?id=f3c4b615
Kenosha Kingfish               /register/team.cgi?id=71fe19cd
Kokomo Jackrabbits             /register/team.cgi?id=8f1a41fc
Rockford Rivets                /register/team.cgi?id=9f4fe2ef
Traverse City Pit Spitters     /register/team.cgi?id=7bc8d111
Kalamazoo Growlers             /register/team.cgi?id=9995d2a1
Fond du Lac Dock Spiders       /register/team.cgi?id=02911efc

...and so on.

回复收藏 0 原文

~没有更多了~

关于作者

迟月

暂无简介

文章

27 人气

关注发私信

友情链接

文江博客

麻烦在非常大的字符串中找到链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

helenabai_sz

993438968

若能看破又如何

情未る

纪平伟

bobowiki

友情链接

麻烦在非常大的字符串中找到链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

helenabai_sz

993438968

若能看破又如何

情未る

纪平伟

bobowiki

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。