URL和HTML Inspect提供不同的结果

发布于 2025-02-06 15:49:08 字数 843 浏览 2 评论 0原文

当我复制Facebook页面的URL并创建一个美丽的对象时,它给了我一个文本,实际上并不是页面上的帖子。即

text = requests.get('https://www.facebook.com/toyota').text
soup = BeautifulSoup(text, 'lxml')
soup.get_text()

返回'\ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n toyota usa -ana sayfa \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n'。

但是,当我检查该Facebook页面并复制HTML元素并遵循类似的步骤时,我会得到我想要的。因此,

html_inspected = Copied HTML Element
soup = BeautifulSoup(html_inspected)
soup.get_text() 

在我想要的Facebook页面上返回实际文本。我的问题是,我应该每次在页面中获取实际内容时检查并复制HTML吗?没有每次检查在Facebook页面上获取帖子和评论的快捷方式吗?

When I copy the url of a facebook page and create a BeautifulSoup object, it gives me a text that is not actually the posts on the pages. Namely

text = requests.get('https://www.facebook.com/toyota').text
soup = BeautifulSoup(text, 'lxml')
soup.get_text()

returns '\n\n\n\n\n\n\n\n\n\n\n\nToyota USA - Ana Sayfa\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n'.

However, when I inspect that Facebook page and copy HTML element and follow similar steps, I get what I want. So

html_inspected = Copied HTML Element
soup = BeautifulSoup(html_inspected)
soup.get_text() 

returns the actual text on the Facebook page I want. My question is am I supposed to inspect and copy the HTML every time I want to get the actual content in a page? Isn't there any shortcut for getting posts and comments on a Facebook page without inspecting every time?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

云淡风轻 2025-02-13 15:49:08

正如@hedgehog指出的那样,这可能是一个JavaScript问题。

最简单的解决方案是使用现成的 scraper library 用于任务:

from facebook_scraper import get_posts

for post in get_posts('toyota', pages=1):
    print(post['text'][:50])

或您可以使用硒:

from selenium.webdriver import Firefox
from selenium.webdriver.common.by import By

with Firefox() as driver:
    driver.get('https://www.facebook.com/toyota')
    elem = driver.find_elements(By.XPATH, '//div[@dir="auto"]')

    for item in elem:
        print(item.text)

As pointed out by @HedgeHog, this could be a JavaScript issue.

The simplest solution would be to use a ready-made scraper library for the task:

from facebook_scraper import get_posts

for post in get_posts('toyota', pages=1):
    print(post['text'][:50])

Alternatively, you could use Selenium:

from selenium.webdriver import Firefox
from selenium.webdriver.common.by import By

with Firefox() as driver:
    driver.get('https://www.facebook.com/toyota')
    elem = driver.find_elements(By.XPATH, '//div[@dir="auto"]')

    for item in elem:
        print(item.text)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文