抓取没有 HTML 的纯文本文件?
我在纯文本文件中有以下数据:
1. Value
Location : Value
Owner: Value
Architect: Value
2. Value
Location : Value
Owner: Value
Architect: Value
... upto 200+ ...
每个段的编号和单词值都会发生变化。
现在我需要将此数据插入到 MySQL 数据库中。
您对如何遍历和抓取它有什么建议,以便我可以获得数字旁边的文本值以及“位置”、“所有者”、“建筑师”的值?
由于没有 HTML 标签,因此 DOM 抓取类似乎很难做到。
I have the following data in a plain text file:
1. Value
Location : Value
Owner: Value
Architect: Value
2. Value
Location : Value
Owner: Value
Architect: Value
... upto 200+ ...
The numbering and the word Value changes for each segment.
Now I need to insert this data in to a MySQL database.
Do you have a suggestion on how can I traverse and scrape it so I can get the value of the text beside the number, and the value of "location", "owner", "architect" ?
Seems hard to do with DOM scraping class since there is no HTML tags present.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
如果数据是连续结构化的,您可以使用
fscanf
从文件中扫描它们。更多关于文档中的
fscanf
。If the data is constantly structured, you can use
fscanf
to scan them from file.More about
fscanf
in docs.如果每个块具有相同的结构,您可以使用
file()
函数来执行此操作:http://nl.php.net/manual/en/function.file.php这样,每一行都是数组中的一个项目,您可以循环遍历它。
If every block has the same structure, you could do this with the
file()
function: http://nl.php.net/manual/en/function.file.phpWith this every row is an item in the array, and you could loop through it.
这将与一个非常简单的有状态的面向行的解析器一起使用。将解析后的数据累积到 array() 中的每一行。当某些信息告诉您正在处理新记录时,您将转储解析的内容并再次继续。
面向行的解析器有一个很好的特性:它们需要很少的内存,而且最重要的是恒定的内存。他们可以毫不费力地处理千兆字节的数据。我正在管理一堆生产服务器,没有什么比那些脚本将整个文件放入内存中更糟糕的了(然后用解析的内容填充数组,这需要两倍于原始文件大小的内存)。
这是可行的,并且基本上是牢不可破的:
显然,您需要在函数 dump_record 中适合您口味的东西,例如打印格式正确的 INSERT SQL 语句。
That will work with a very simple stateful line-oriented parser. Every line you cumulate parsed data into an array(). When something tells you're on a new record, you dump what you parsed and proceed again.
Line-oriented parsers have a great property : they require little memory and what's most important, constant memory. They can proceed with gigabytes of data without any sweat. I'm managing a bunch of production servers and there's nothing worse than those scripts slurping whole files into memory (then stuffing arrays with parsed content which requires more than twice the original file size as memory).
This works and is mostly unbreakable :
Obvously you'll need something suited to your taste in
function dump_record
, like printing a correctly formated INSERT SQL statement.这会给你你想要的,
This will give you what you want,
同意Topener解决方案,这里有一个例子,如果每个块是4行+空行:
结构将很方便使用,对于MySQL或其他什么。我没有添加代码来从第一段中删除冒号。
祝你好运!
Agreed with Topener solution, here's an example if each block is 4 lines + blank line:
Structure will be convenient to use, for MySQL or whatelse. I didn't add code to remove the colon from the first segment.
Good luck!