我正在使用开源 Perl 脚本根据英语维基百科转储创建文本语料库。明文已经提取出来了,但是各种标点符号之类的还需要去掉。然而,该脚本的输出实质上创建了一个包含单行的 7.2GiB 文本文件。由于我的需要,我想更改脚本,使其每 20 个单词插入一个换行符。
到目前为止,我已经尝试过这个:
$wordCount=0;
while (<STDIN>) {
$wordCount++;
//text processing regex commands here
# Remove ellipses
s/\.\.\./ /g;
# Remove dashes surrounded by spaces (e.g. phrase - phrase)
s/\s-+\s/ /g;
# Remove dashes between words with no spaces (e.g. word--word)
s/([A-Za-z0-9])\-\-([A-Za-z0-9])/$1 $2/g;
# Remove dash at a word end (e.g. three- to five-year)
s/(\w)-\s/$1 /g;
# Remove some punctuation
s/([\"\�,;:%�?�!()\[\]{}<>_\.])/ /g;
# Remove trailing space
s/ $//;
# Remove double single-quotes
s/'' / /g;
s/ ''/ /g;
# Replace accented e with normal e for consistency with the CMU pronunciation dictionary
s/�/e/g;
# Remove single quotes used as quotation marks (e.g. some 'phrase in quotes')
s/\s'([\w\s]+[\w])'\s/ $1 /g;
# Remove double spaces
s/ / /g;
chomp($_);
if ($wordCount == 20){
print uc($_) . "\n";
$wordCount=0;
}
print uc($_) . " ";
}
print "\n";
但是,这似乎不起作用,因为原始输出只有任意分散的换行符。我希望对文本进行格式化,使其适合典型的 1200 像素宽显示器,而不需要自动换行。
文件中的示例输入文本是
简明牛津政治词典。无政府主义的支持者
(被称为“无政府主义者”)主张无国家社会是唯一的道德
社会组织形式。有很多种类型和传统
无政府主义,并非所有这些都是相互排斥的。无政府主义作为
社会运动的受欢迎程度经常出现波动。这
无政府主义一词源自希腊语ἄναρχος,anarchos,意思是
“没有统治者”,它作为同义词的使用在国外仍然很常见
美国。最早的无政府主义主题可以在第六章中找到。
公元前世纪,在道家哲学家老子的著作中,以及后来的
庄子和包精衍的世纪。首先是“无政府主义者”一词
1642 年英国内战期间进入英语
保皇党用来辱骂圆头党对手的术语。
到了法国大革命时期,一些人,例如Enragés,开始
积极地使用这个词,反对雅各宾集权
权力,认为“革命政府”是矛盾的。由
十九世纪之交,英语单词“anarchism”已经失去了它的意义
最初的负面含义。现代无政府主义源于世俗
或启蒙运动的宗教思想,特别是让-雅克
卢梭关于自由的道德中心地位的论证。无政府主义”,
Encarta 在线百科全书 2006(英国版)。从这样的气候来看
威廉·戈德温 (William Godwin) 提出了许多人认为的第一个表达方式
现代无政府主义思想。根据彼得·克鲁泡特金的说法,戈德温是:
“第一个提出政治和经济概念的人
无政府主义,尽管他没有给这个想法起这个名字
在他的工作中发展”,而戈德温则将他的无政府主义思想附加到
早期的埃德蒙·伯克。无政府共产主义者约瑟夫·德雅克(Joseph Déjacque)是
第一个将自己描述为“自由主义者”的人。与蒲鲁东不同的是,他
认为,“一个人的劳动成果并不是他或她的劳动成果”
工人有权利,但要满足他或她的需要,
无论继承人的本性如何。耶稣有时被认为是第一位
基督教无政府主义传统中的无政府主义者。乔治·勒沙蒂埃
写道:“无政府状态的真正创始人是耶稣基督,并且……
1848 年革命之后,欧洲做出了严厉的反应,在此期间
有十个国家经历过短暂或长期的社会动荡
团体进行了民族主义起义。大多数这些之后
系统性变革的尝试以失败告终,保守派人士
利用社会主义者、无政府主义者的分裂群体,
自由主义者和民族主义者,以防止进一步的叛乱。布朗基派,
费城、英国工会成员、社会主义者和社会党人
民主党人。由于它与积极的工人运动的联系,
国际成为一个重要的组织。卡尔·马克思成为
国际的领军人物及其总会成员
理事会。蒲鲁东的追随者,互利主义者,反对马克思的国家
社会主义,提倡政治弃权主义和小财产
持股。 1868 年,在他们参加失败之后
和平与自由联盟(LPF),俄罗斯革命家米哈伊尔
巴枯宁和他的集体主义无政府主义协会加入了第一党
国际(已决定不参与LPF)。在
首先,集体主义者与马克思主义者合作,推动第一
国际朝着更加革命的社会主义方向发展。
随后,国际分裂为两个阵营,
马克思和巴枯宁是他们各自的傀儡。 1872 年,
冲突最终以两个群体之间的分裂而达到高潮
海牙会议,巴枯宁和詹姆斯纪尧姆被驱逐出海牙会议
国际及其总部迁至纽约。
作为回应,联邦党支部成立了自己的国际
圣伊米尔代表大会,通过了革命的无政府主义纲领。
黑玫瑰图书 2005 年)ISBN 1-55164-251-4。
文件中有大约 7 场演出的文本。因此,使用列表或其他数据结构对于这些要求可能有点矫枉过正。
为了满足我的要求需要什么?
I'm using an open-source perl script to create a text corpus based on the English language wikipedia dump. The plain text has been extracted, but various punctuation marks and the like still need to be removed. However, the output of this script essentially creates a 7.2GiB text file containing a single line. Due to my needs, I want to alter the script such that it inserts a new line character every 20 words.
So far, I've tried this:
$wordCount=0;
while (<STDIN>) {
$wordCount++;
//text processing regex commands here
# Remove ellipses
s/\.\.\./ /g;
# Remove dashes surrounded by spaces (e.g. phrase - phrase)
s/\s-+\s/ /g;
# Remove dashes between words with no spaces (e.g. word--word)
s/([A-Za-z0-9])\-\-([A-Za-z0-9])/$1 $2/g;
# Remove dash at a word end (e.g. three- to five-year)
s/(\w)-\s/$1 /g;
# Remove some punctuation
s/([\"\�,;:%�?�!()\[\]{}<>_\.])/ /g;
# Remove trailing space
s/ $//;
# Remove double single-quotes
s/'' / /g;
s/ ''/ /g;
# Replace accented e with normal e for consistency with the CMU pronunciation dictionary
s/�/e/g;
# Remove single quotes used as quotation marks (e.g. some 'phrase in quotes')
s/\s'([\w\s]+[\w])'\s/ $1 /g;
# Remove double spaces
s/ / /g;
chomp($_);
if ($wordCount == 20){
print uc($_) . "\n";
$wordCount=0;
}
print uc($_) . " ";
}
print "\n";
However, this doesn't seem to work, as the raw output has only newlines scattered around arbitrarily. I'd like to have the text formatted so it will fit on a typical 1200px wide monitor without word wrapping.
A sample input text from the file is
The Concise Oxford Dictionary of Politics. Proponents of anarchism
(known as "anarchists") advocate stateless societies as the only moral
form of social organization. There are many types and traditions of
anarchism, not all of which are mutually exclusive. Anarchism as a
social movement has regularly endured fluctuations in popularity. The
term anarchism derives from the Greek ἄναρχος, anarchos, meaning
"without rulers", its use as a synonym is still common outside the
United States. The earliest anarchist themes can be found in the 6th
century BC, among the works of Taoist philosopher Laozi, and in later
centuries by Zhuangzi and Bao Jingyan. The term "anarchist" first
entered the English language in 1642, during the English Civil War, as
a term of abuse, used by Royalists against their Roundhead opponents.
By the time of the French Revolution some, such as the Enragés, began
to use the term positively, in opposition to Jacobin centralisation
of power, seeing "revolutionary government" as oxymoronic. By the
turn of the 19th century, the English word "anarchism" had lost its
initial negative connotation. Modern anarchism sprang from the secular
or religious thought of the Enlightenment, particularly Jean-Jacques
Rousseau's arguments for the moral centrality of freedom. Anarchism",
Encarta Online Encyclopedia 2006 (UK version). From this climate
William Godwin developed what many consider the first expression of
modern anarchist thought. Godwin was, according to Peter Kropotkin,
"the first to formulate the political and economical conceptions of
anarchism, even though he did not give that name to the idea s
developed in his work", while Godwin attached his anarchist ideas to
an early Edmund Burke. The anarcho-communist Joseph Déjacque was the
first person to describe himself as "libertarian". Unlike Proudhon, he
argued that, "it is not the product of his or her labor that the
worker has a right to, but to the satisfaction of his or her needs,
whatever may be t heir nature. Jesus is sometimes considered the first
anarchist in the Christian anarchist tradition. Georges Lechartier
wrote that "The true founder of anarchy was Jesus Christ and . In
Europe, harsh reaction followed the revolutions of 1848, during which
ten countries had experienced brief or long-term social upheaval as
groups carried out nationalis t uprisings. After most of these
attempts at systematic change ended in failure, conservative elements
took advantage of the divided groups of socialists, anarchists,
liberals, and na tionalists, to prevent further revolt. Blanquists,
Philadelphes, English trade unionists, socialists and social
democrats. Due to its links to active workers' movements, the
International became a significant organization. Karl Marx became a
leading figure in the International and a member of its General
Council. Proudhon's followers, the mutualists, opposed Marx's state
socialism, advocating political abstentionism and small property
holdings. In 1868, following their unsuccessful participation in the
League of Peace and Freedom (LPF), Russian revolutionary Mikhail
Bakunin and his collectivist anarchist associa tes joined the First
International (which had decided not to get involved with the LPF). At
first, the collectivists worked with the Marxists to push the First
International in a more revolutionary socialist direction.
Subsequently, the International became polarised into two camps, with
Marx and Bakunin as their respective figureheads. In 1872, the
conflict climaxed with a final split between the two groups at the
Hague Congress, where Bakunin and James Guillaume were expelled from
the International and its headquarters were transferred to New York.
In response, the federalist sections formed their own International at
the St. Imier Congress, adopting a revolutionary anarchist program.
Black Rose Books 2005) ISBN 1-55164-251-4.
There's 7-something gigs worth of text in the file. So using a list or other data structure might be a bit of overkill for these requirements.
What is needed in order to fit my requirements?
发布评论
评论(5)
对于 Perl 来说,有多种方法可以解决这个问题,但一种(反常?!)方法是逐字节读取文件而不是逐行读取文件,或者将整个文件吞入其中。这是相当暴力的但它有效。本质上,您是在用内存使用换取磁盘使用。
True to Perl, there are various ways to solve this, but one (perverse?!) way to do it is to read the file byte by byte instead of line by line, or slurping the whole thing in. It's rather brute force-ish but it works. Essentially you are trading memory use for disk usage.
考虑使用类似 Text::Wrap 或 文本::自动格式 。
Consider using something like Text::Wrap or Text::Autoformat .
首先,将 Perl 的输入记录分隔符设置为常用且有用的内容,例如空格:
然后按单词循环输入单词:
修剪单词 :
如果全是空格,请跳过它:
执行您需要的任何其他转换
,然后将其添加到堆栈中,分割任何内部制表符或其他类似空格的字符:
接下来,检查是否有 20 个单词,如果有,则打印他们:
然后打印剩余的内容:
First, set perl's input record separator to something frequent and useful, like a space:
then loop over the input word by word:
trim the word:
skip it if it was all space:
do any other transforms you need
and then add it to a stack, splitting any internal tabs or other space-like characters:
next, check to see if you have 20 words, and if so, print them:
then print anything remaining:
在不知道有关此问题的更多细节的情况下,我建议采用强力解决方案:
吞掉整个条目,
根据“”分割成数组,
foreach 数组并在每 20 个元素后打印“\n”。
Without knowing more details about this problem, I'd suggest a brute force solution:
slurp the entire entry,
split to an array based on " ",
foreach the array and print "\n" after every 20 elements.