从欧洲议会网站抓取数据时出现东欧字符问题
编辑:非常感谢您提出的所有答案和观点。作为一个新手我有点不知所措,但是这是继续学习python的很大动力!!
我正在尝试从欧洲议会网站上获取大量数据用于一个研究项目。第一步是创建所有议员的列表,但是由于有许多东欧名字和他们使用的口音,我得到了很多缺失的条目。下面是一个给我带来麻烦的示例(注意姓氏末尾的重音符号):
<td class="listcontentlight_left">
<a href="/members/expert/alphaOrder/view.do?language=EN&id=28276" title="ANDRIKIENĖ, Laima Liucija">ANDRIKIENĖ, Laima Liucija</a>
<br/>
Group of the European People's Party (Christian Democrats)
<br/>
</td>
到目前为止,我一直在使用 PyParser 和以下代码:
#parser_names
name = Word(alphanums + alphas8bit)
begin, end = map(Suppress, "><")
names = begin + ZeroOrMore(name) + "," + ZeroOrMore(name) + end
for name in names.searchString(page):
print(name)
但是,这并没有从上面的 html 中捕获名称。关于如何进行的任何建议?
最好,Thomas
P.S:这是我迄今为止拥有的所有代码:
# -*- coding: utf-8 -*-
import urllib.request
from pyparsing_py3 import *
page = urllib.request.urlopen("http://www.europarl.europa.eu/members/expert/alphaOrder.do?letter=B&language=EN")
page = page.read().decode("utf8")
#parser_names
name = Word(alphanums + alphas8bit)
begin, end = map(Suppress, "><")
names = begin + ZeroOrMore(name) + "," + ZeroOrMore(name) + end
for name in names.searchString(page):
print(name)
EDIT: thanks a lot for all the answers an points raised. As a novice I am a bit overwhelmed, but it is a great motivation for continuing learning python!!
I am trying to scrape a lot of data from the European Parliament website for a research project. The first step is to create a list of all parliamentarians, however due to the many Eastern European names and the accents they use i get a lot of missing entries. Here is an example of what is giving me troubles (notice the accents at the end of the family name):
<td class="listcontentlight_left">
<a href="/members/expert/alphaOrder/view.do?language=EN&id=28276" title="ANDRIKIENĖ, Laima Liucija">ANDRIKIENĖ, Laima Liucija</a>
<br/>
Group of the European People's Party (Christian Democrats)
<br/>
</td>
So far I have been using PyParser and the following code:
#parser_names
name = Word(alphanums + alphas8bit)
begin, end = map(Suppress, "><")
names = begin + ZeroOrMore(name) + "," + ZeroOrMore(name) + end
for name in names.searchString(page):
print(name)
However this does not catch the name from the html above. Any advice in how to proceed?
Best, Thomas
P.S: Here is all the code i have so far:
# -*- coding: utf-8 -*-
import urllib.request
from pyparsing_py3 import *
page = urllib.request.urlopen("http://www.europarl.europa.eu/members/expert/alphaOrder.do?letter=B&language=EN")
page = page.read().decode("utf8")
#parser_names
name = Word(alphanums + alphas8bit)
begin, end = map(Suppress, "><")
names = begin + ZeroOrMore(name) + "," + ZeroOrMore(name) + end
for name in names.searchString(page):
print(name)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
我能够使用代码显示 31 个以 A 开头的名称:
正如 John 注意到您需要更多的 unicode 字符(
extended_chars
),并且某些名称具有连字符等(特殊字符
)。计算您收到的名字数量,并检查页面的计数是否与我为“A”所做的计数相同。范围 0x80-0x87F 以 utf8 编码可能所有欧洲语言的 2 字节序列。在 pyparsing 示例中,有用于希腊语的
greetingInGreek.py
和用于韩语文本解析的其他示例。如果 2 个字节不够,请尝试:
I was able to show 31 names starting with A with code:
As John noticed you need more unicode characters (
extended_chars
) and some names have hypehen etc. (special chars
). Count how many names you received and check if page has the same count as I do for 'A'.Range 0x80-0x87F encode 2 bytes sequences in utf8 of probably all european languages. In pyparsing examples there is
greetingInGreek.py
for Greek and other example for Korean texts parsing.If 2 bytes are not enough then try:
您确定编写自己的解析器来从 HTML 中挑选位是最好的选择吗?您可能会发现使用专用的 HTML 解析器更容易。 Beautiful Soup 可以让你指定你感兴趣的使用 DOM 的位置,所以拉类为“listcontentlight_left”的表格单元格内第一个链接的文本非常简单:
Are you sure that writing your own parser to pick bits out of HTML is the best option? You might find it easier to use a dedicated HTML parser. Beautiful Soup which lets you specify the location you're interested in using the DOM, so pulling the text from the first link inside a table cell with class "listcontentlight_left" is quite easy:
如果您可以正常使用西欧名字(它们也有很多口音等!),那么看起来您遇到了某种编码问题。向我们展示您的所有代码以及您尝试抓取且存在仅限东部问题的典型页面的 URL。显示你拥有的 html 片段并没有多大用处;我们不知道它经历了怎样的转变;至少,使用 repr() 函数的结果。
更新 该 MEP 姓名中的违规字符是 U+0116(拉丁字母大写 E,上面带有点)。所以它不包含在pyparsing的“alphanums + alphas8bit”中。 Westies (latin-1) 都适合您已有的东西。我对 pyparsing 知之甚少;您需要找到一个包含所有 unicode 字母表的 pyparsing 表达式...而不仅仅是 Latin-n,以防他们开始使用西里尔字母作为保加利亚 MEP,而不是当前转录为 ASCII :-)
其他观察结果:
(1) alphaNUMs 。 ..名字中的数字?
(2) 名称可以包含撇号和连字符,例如 O'Reilly、Foughbarre-Smith
Looks like you've got some kind of encoding problem if you are getting western European names OK (they have lots of accents etc also!). Show us all of your code plus the URL of a typical page that you are trying to scrape and has the East-only problem. Displaying the piece of html that you have is not much use; we have no idea what transformations it has been through; at the very least, use the result of the repr() function.
Update The offending character in that MEP's name is U+0116 (LATIN LETTER CAPITAL E WITH DOT ABOVE). So it is not included in pyparsing's "alphanums + alphas8bit". The Westies (latin-1) will all fit in what you've got already. I know little about pyparsing; you'll need to find a pyparsing expression that includes ALL unicode alphabetics ... not just Latin-n in case they start using Cyrillic for the Bulgarian MEPs instead of the current transcription into ASCII :-)
Other observations:
(1) alphaNUMs ... digits in a name?
(2) names may include apostrophe and hyphen e.g. O'Reilly, Foughbarre-Smith
起初我想我建议尝试从 python 的
unicodedata.category
方法构建一个自定义字母类,当给定一个字符时,该方法会告诉您该代码点分配给 acc 的哪个类Unicode 字符类别;这会告诉你一个代码点是大写字母还是小写字母、数字还是其他什么。再想一想,想起我前几天给出的答案,让我建议另一种方法。从国家走向全球时,我们必须摆脱许多隐含的假设;其中一个肯定是“一个字符等于一个字节”,另一个是“一个人的名字是由字母组成的,我知道可能的字母是什么”。 unicode 非常庞大,欧盟目前有 23 种官方语言,用三种字母书写;弄清楚每种语言到底使用哪些字符需要做大量的工作。希腊语使用那些奇特的撇号,并且分布在至少 367 个代码点上;保加利亚语使用西里尔字母以及该语言特有的大量额外字符。
那么为什么不简单地扭转局面,利用这些名字出现的更大背景呢?我浏览了一些样本数据,看起来MEP名称的一般模式是
LASTNAME,Firstname
,其中(1)姓氏(几乎)大写; (2) 逗号和空格; (3)普通情况下的名字。这甚至适用于更多“异常”的例子,例如 GERINGER de OEDENBERG、Lidia Joanna、GALLAGHER、Pat the Cope(哇)、McGUINNESS、Mairread 。从姓氏中恢复普通大小写需要一些工作(也许保留所有小写字母,并将前面有另一个大写字母的任何大写字母小写),但事实上提取姓名是简单:没错 - 由于 EUP 非常好地呈现包含在 HTML 标记中的名称,因此您已经知道它的最大范围,因此您可以剪掉该最大范围并将其分成两部分。在我看来,你所需要寻找的只是一系列逗号、空格的第一次出现——之前的所有内容都是最后一个,在人名后面的所有内容。我称之为“剪影方法”,因为它就像看消极的轮廓,而不是积极的形式组成。
如前所述,有些名称使用连字符;现在 unicode 中有几个看起来像连字符的代码点。让我们希望布鲁塞尔的打字员在使用上保持一致。啊,还有很多姓氏使用撇号,比如
d'Hondt
、d'Alambert
。快乐狩猎:可能的化身包括 U+0060、U+00B4、U+0027、U+02BC 和相当数量的相似物。大多数这些代码点在姓氏中使用都是“错误的”,但是您最后一次看到这些代码点正确使用是什么时候?我有点不信任
alphanums + alphas8bit + Extended_chars + Special_chars
模式;至少 alphanums 部分有点令人讨厌,因为它似乎包含数字(哪些数字? unicode 定义了几百个数字字符),而 alphas8bit 确实散发着一股再次配制溶剂。 unicode 从概念上讲是在 32 位空间中工作的。 8bit 是什么意思?在代码页 852 中找到字母?拜托,这是 2010 年了。啊,回头看看,我发现你似乎正在使用 pyparsing 来解析 HTML。 不要这样做。 使用例如美丽的汤来整理标记;即使处理有缺陷的 HTML(大多数 HTML 都无法验证),它也能很好地处理,一旦您了解了它,它无疑是奇妙的 API(您所需要的可能就是
find()
方法)准确地找出您要查找的文本片段将很容易。at first i thought i’d recommend to try and build a custom letter class from python’s
unicodedata.category
method, which, when given a character, will tell you what class that codepoint is assigned to acc to the unicode character category; this would tell you whether a codepoint is e.g. an uppercase or lowercase letter, a digit or whatever.on second thought and remiscent of an answer i gave the other day, let me suggest another approach. there are many implicit assumptions we have to get rid of when going from national to global; one of them is certainly that ‘a character equals a byte’, and one other is that ‘a person’s name is made up of letters, and i know what the possible letters are’. unicode is vast, and the eu currently has 23 official languages written in three alphabets; exactly what characters are used for each language will involve quite a bit of work to figure out. greek uses those fancy apostrophies and is distributed across at least 367 codepoints; bulgarian uses the cyrillic alphabet with a slew of extra characters unique to the language.
so why not simply turn the tables and take advantage of the larger context those names appear in? i brosed through some sample data and it looks like the general pattern for MEP names is
LASTNAME, Firstname
with (1) the last name in (almost) upper case; (2) a comma and a space; (3) the given names in ordinary case. this even holds in more ‘deviant’ examples likeGERINGER de OEDENBERG, Lidia Joanna
,GALLAGHER, Pat the Cope
(wow),McGUINNESS, Mairead
. It would take some work to recover the ordinary case from the last names (maybe leave all the lower case letters in place, and lower-case any capital letters that are preceded by another capital letters), but to extract the names is, in fact simple:that’s right—since the EUP was so nice to present names enclosed in an HTML tag, you already know the maximum extent of it, so you can just cut out that maximum extent and split it up in two parts. as i see it, all you have to look for is the first occurrence of a sequence of comma, space—everything before that is the last, anything behind that the given names of the person. i call that the ‘silhouette approach’ since it’s like looking at the negative, the outline, rather than the positive, what the form is made up from.
as has been noted earlier, some names use hyphens; now there are several codepoints in unicode that look like hyphens. let’s hope the typists over there in brussels were consistent in their usage. ah, and there are many surnames using apostrophes, like
d'Hondt
,d'Alambert
. happy hunting: possible incarnations include U+0060, U+00B4, U+0027, U+02BC and a fair number of look-alikes. most of these codepoints would be ‘wrong’ to use in surnames, but when was the last time you saw thos dits used correctly?i somewhat distrust that
alphanums + alphas8bit + extended_chars + special_chars
pattern; at least thatalphanums
part is a tad bogey as it seems to include digits (which ones? unicode defines a few hundred digit characters), and thatalphas8bit
thingy does reek of a solvent made for another time. unicode conceptually works in a 32bit space. what’s 8bit intended to mean? letters found in codepage 852? c’mon this is 2010.ah, and looking back i see you seem to be parsing the HTML with pyparsing. don’t do that. use e.g. beautiful soup for sorting out the markup; it’s quite good at dealing even with faulty HTML (most HTML in the wild does not validate) and once you get your head about it’s admittedly wonderlandish API (all you ever need is probably the
find()
method) it will be simple to fish out exactly those snippets of text you’re looking for.尽管 BeautifulSoup 是 HTML 解析的事实标准,但 pyparsing 有一些替代方法也适用于 HTML(当然比强力 reg exps 更有优势)。其中一个特别的函数是 makeHTMLTags,它采用单个字符串参数(基本标签),并返回 pyparsing 表达式的 2 元组,一个用于开始标签,一个用于结束标签。请注意,开始标记表达式不仅仅返回“<”+tag+“>”的等效内容。它还:
处理标签的上/下壳
本身
处理嵌入属性
(将它们作为命名结果返回)
处理具有
命名空间
处理单引号、双引号或无引号中的属性值
处理空标记,如
在结束 '>'
之前尾随 '/'
可以过滤特定的
使用 withAttribute 的属性
解析操作
,我建议您不要尝试匹配特定的名称内容,而是尝试匹配周围的
标记,然后访问 title 属性。像这样:
现在您将获得标题属性中的任何内容,无论字符集如何。
Even though BeautifulSoup is the de facto standard for HTML parsing, pyparsing has some alternative approaches that lend themselves to HTML too (certainly a leg up over brute force reg exps). One function in particular is makeHTMLTags, which takes a single string argument (the base tag), and returns a 2-tuple of pyparsing expressions, one for the opening tag and one for the closing tag. Note that the opening tag expression does far more than just return the equivalent of "<"+tag+">". It also:
handles upper/lower casing of the tag
itself
handles embedded attributes
(returning them as named results)
handles attribute names that have
namespaces
handles attribute values in single, double, or no quotes
handles empty tags, as indicated by a
trailing '/' before the closing '>'
can be filtered for specific
attributes using the withAttribute
parse action
So instead of trying to match the specific name content, I suggest you try matching the surrounding
<a>
tag, and then accessing the title attribute. Something like this:Now you get whatever is in the title attribute, regardless of character set.