URL和HTML Inspect提供不同的结果

发布于 2025-02-06 15:49:08 字数 843 浏览 2 评论 0原文

当我复制Facebook页面的URL并创建一个美丽的对象时，它给了我一个文本，实际上并不是页面上的帖子。即

text = requests.get('https://www.facebook.com/toyota').text
soup = BeautifulSoup(text, 'lxml')
soup.get_text()

返回'\ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n toyota usa -ana sayfa \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n'。

但是，当我检查该Facebook页面并复制HTML元素并遵循类似的步骤时，我会得到我想要的。因此，

html_inspected = Copied HTML Element
soup = BeautifulSoup(html_inspected)
soup.get_text()

在我想要的Facebook页面上返回实际文本。我的问题是，我应该每次在页面中获取实际内容时检查并复制HTML吗？没有每次检查在Facebook页面上获取帖子和评论的快捷方式吗？

原文

When I copy the url of a facebook page and create a BeautifulSoup object, it gives me a text that is not actually the posts on the pages. Namely

text = requests.get('https://www.facebook.com/toyota').text
soup = BeautifulSoup(text, 'lxml')
soup.get_text()

returns '\n\n\n\n\n\n\n\n\n\n\n\nToyota USA - Ana Sayfa\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n'.

However, when I inspect that Facebook page and copy HTML element and follow similar steps, I get what I want. So

html_inspected = Copied HTML Element
soup = BeautifulSoup(html_inspected)
soup.get_text()

returns the actual text on the Facebook page I want. My question is am I supposed to inspect and copy the HTML every time I want to get the actual content in a page? Isn't there any shortcut for getting posts and comments on a Facebook page without inspecting every time?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

云淡风轻 2025-02-13 15:49:08

正如@hedgehog指出的那样，这可能是一个JavaScript问题。

最简单的解决方案是使用现成的 scraper library 用于任务：

from facebook_scraper import get_posts

for post in get_posts('toyota', pages=1):
    print(post['text'][:50])

或您可以使用硒：

from selenium.webdriver import Firefox
from selenium.webdriver.common.by import By

with Firefox() as driver:
    driver.get('https://www.facebook.com/toyota')
    elem = driver.find_elements(By.XPATH, '//div[@dir="auto"]')

    for item in elem:
        print(item.text)

As pointed out by @HedgeHog, this could be a JavaScript issue.

The simplest solution would be to use a ready-made scraper library for the task:

from facebook_scraper import get_posts

for post in get_posts('toyota', pages=1):
    print(post['text'][:50])

Alternatively, you could use Selenium:

from selenium.webdriver import Firefox
from selenium.webdriver.common.by import By

with Firefox() as driver:
    driver.get('https://www.facebook.com/toyota')
    elem = driver.find_elements(By.XPATH, '//div[@dir="auto"]')

    for item in elem:
        print(item.text)

回复收藏 0 原文

~没有更多了~