我需要从网站上刮去电子邮件,但是没有类似的班级名称或SMTH

发布于 2025-01-30 06:03:24 字数 2015 浏览 3 评论 0 原文

我有以下问题报废网站。我有一个带人电子邮件的3700页,我需要实现它们。问题在于它们不包含任何类名称,而对于不同页面的XPath可能会有所不同,有时在电子邮件前有电话号码会破坏所有内容。我尝试使用硒的不同解决方案,但这行不起作用。您能给我一些有关如何处理此问题以及如何刮擦它们的建议。以下是一些页面的一些示例,其中显示了不同的HTML结构。谢谢!

<div>
   <div><i class="fa fa-envelope" style="margin-right: 0.5rem;"></i><span style="unicode-bidi: bidi-override; direction: rtl;"> moc.ttobbanaej@naej</span></div>
   <div><a href="http://JeanAbbott.com" target="_blank" class="websiteLink" rel="noopener noreferrer" style="overflow-wrap: normal; text-overflow: ellipsis; overflow: hidden;">JeanAbbott.com</a></div>
   <div id="contactInfoWrap" style="margin-top: 10px;">
      <div>Jean Abbott</div>
      <div>
         <div>5 Colonial Circle</div>
         <div>Medicine Lake, MN 55441</div>
         <div>US</div>
      </div>
   </div>
</div>

另一个是

<div>
   <div><i class="fa fa-phone" style="margin-right: 0.5rem;"></i>202-800-7057</div>
   <div><i class="fa fa-envelope" style="margin-right: 0.5rem;"></i><span style="unicode-bidi: bidi-override; direction: rtl;"> moc.tsiugnilde@ahsynal</span></div>
   <div><a href="http://edlinguist.com/" target="_blank" class="websiteLink" rel="noopener noreferrer" style="overflow-wrap: normal; text-overflow: ellipsis; overflow: hidden;">edlinguist.com/</a></div>
   <div id="contactInfoWrap" style="margin-top: 10px;">
      <div>LaNysha Adams</div>
      <div>
         <div>80 M St SE</div>
         <div>1st Floor</div>
         <div>Washington, DC 20003</div>
         <div>US</div>
      </div>
   </div>
</div>

我需要的元素

<span style="unicode-bidi: bidi-override; direction: rtl;"> moc.ttobbanaej@naej</span>

I have following problem scrapping site. I have a 3700 pages with person email and I need to achive them. The problem is that they do not contain any class name and Xpath can be different for different pages beacuse sometimes there are phone number before email and it breaks everything. I try to use a different solutions with selenium, but it doesn`t work. Can you please give me some advices of how to deal with this and how I can scrape them. Below is some examples of pages where different structure of html is presented. Thanks!

<div>
   <div><i class="fa fa-envelope" style="margin-right: 0.5rem;"></i><span style="unicode-bidi: bidi-override; direction: rtl;"> moc.ttobbanaej@naej</span></div>
   <div><a href="http://JeanAbbott.com" target="_blank" class="websiteLink" rel="noopener noreferrer" style="overflow-wrap: normal; text-overflow: ellipsis; overflow: hidden;">JeanAbbott.com</a></div>
   <div id="contactInfoWrap" style="margin-top: 10px;">
      <div>Jean Abbott</div>
      <div>
         <div>5 Colonial Circle</div>
         <div>Medicine Lake, MN 55441</div>
         <div>US</div>
      </div>
   </div>
</div>

And another one

<div>
   <div><i class="fa fa-phone" style="margin-right: 0.5rem;"></i>202-800-7057</div>
   <div><i class="fa fa-envelope" style="margin-right: 0.5rem;"></i><span style="unicode-bidi: bidi-override; direction: rtl;"> moc.tsiugnilde@ahsynal</span></div>
   <div><a href="http://edlinguist.com/" target="_blank" class="websiteLink" rel="noopener noreferrer" style="overflow-wrap: normal; text-overflow: ellipsis; overflow: hidden;">edlinguist.com/</a></div>
   <div id="contactInfoWrap" style="margin-top: 10px;">
      <div>LaNysha Adams</div>
      <div>
         <div>80 M St SE</div>
         <div>1st Floor</div>
         <div>Washington, DC 20003</div>
         <div>US</div>
      </div>
   </div>
</div>

The element that I need looks like this

<span style="unicode-bidi: bidi-override; direction: rtl;"> moc.ttobbanaej@naej</span>

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

夏了南城 2025-02-06 06:03:24
//div[contains(.,"@")]/span

上面的Xpath表达式将选择您所需的html部分:

<span style="unicode-bidi: bidi-override; direction: rtl;"> moc.tsiugnilde@ahsynal</span>

所需的文本节点值是: moc.tsiugnilde@ahsynal

//div[contains(.,"@")]/span

The above xpath expression will select your desired html portion:

<span style="unicode-bidi: bidi-override; direction: rtl;"> moc.tsiugnilde@ahsynal</span>

and the desired text node value is : moc.tsiugnilde@ahsynal

流心雨 2025-02-06 06:03:24

似乎镜像了电子邮件地址。并解决有样式信息: unicode-bidi:bidi-override;方向:rtl; 意味着 moc.tsiugnilde@ahsynal [email&nbsp; procepted]

因此,最好只使用此XPath:

//span[style='unicode-bidi: bidi-override; direction: rtl;']

It seems like the email-addresses are mirrored. And to address that there is style info: unicode-bidi: bidi-override; direction: rtl; meaning that moc.tsiugnilde@ahsynal is [email protected].

And so it is maybe better to just use this XPath:

//span[style='unicode-bidi: bidi-override; direction: rtl;']
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文