抓取没有 HTML 的纯文本文件?

发布于 2024-12-20 03:22:52 字数 346 浏览 0 评论 0原文

我在纯文本文件中有以下数据:

1.  Value
Location :  Value
Owner:  Value
Architect:  Value

2.  Value
Location :  Value
Owner:  Value
Architect:  Value

... upto 200+ ...

每个段的编号和单词值都会发生变化。

现在我需要将此数据插入到 MySQL 数据库中。

您对如何遍历和抓取它有什么建议,以便我可以获得数字旁边的文本值以及“位置”、“所有者”、“建筑师”的值?

由于没有 HTML 标签,因此 DOM 抓取类似乎很难做到。

I have the following data in a plain text file:

1.  Value
Location :  Value
Owner:  Value
Architect:  Value

2.  Value
Location :  Value
Owner:  Value
Architect:  Value

... upto 200+ ...

The numbering and the word Value changes for each segment.

Now I need to insert this data in to a MySQL database.

Do you have a suggestion on how can I traverse and scrape it so I can get the value of the text beside the number, and the value of "location", "owner", "architect" ?

Seems hard to do with DOM scraping class since there is no HTML tags present.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

仅一夜美梦 2024-12-27 03:22:52

如果数据是连续结构化的,您可以使用fscanf从文件中扫描它们。

/* Notice the newlines at the end! */
$format = <<<FORMAT
%d. %s
Location :  %s
Owner:  %s
Arcihtect:  %s


FORMAT;

$file = fopen('file.txt', 'r');
while ($data = fscanf($file, $format)) {
    list($number, $title, $location, $owner, $architect) = $data;
    // Insert the data to database here
}
fclose($file);

更多关于文档中的fscanf

If the data is constantly structured, you can use fscanf to scan them from file.

/* Notice the newlines at the end! */
$format = <<<FORMAT
%d. %s
Location :  %s
Owner:  %s
Arcihtect:  %s


FORMAT;

$file = fopen('file.txt', 'r');
while ($data = fscanf($file, $format)) {
    list($number, $title, $location, $owner, $architect) = $data;
    // Insert the data to database here
}
fclose($file);

More about fscanf in docs.

娇柔作态 2024-12-27 03:22:52

如果每个块具有相同的结构,您可以使用 file() 函数来执行此操作:http://nl.php.net/manual/en/function.file.php

$data = file('path/to/file.txt');

这样,每一行都是数组中的一个项目,您可以循环遍历它。

for ($i = 0; $i<count($data); $i+=5){
    $valuerow = $data[$i];
    $locationrow = $data[$i+1];
    $ownerrow = $data[$i+2];
    $architectrow = $data[$i+3];
    // strip the data you don't want here, and instert it into the database.
}

If every block has the same structure, you could do this with the file() function: http://nl.php.net/manual/en/function.file.php

$data = file('path/to/file.txt');

With this every row is an item in the array, and you could loop through it.

for ($i = 0; $i<count($data); $i+=5){
    $valuerow = $data[$i];
    $locationrow = $data[$i+1];
    $ownerrow = $data[$i+2];
    $architectrow = $data[$i+3];
    // strip the data you don't want here, and instert it into the database.
}
荒岛晴空 2024-12-27 03:22:52

这将与一个非常简单的有状态的面向行的解析器一起使用。将解析后的数据累积到 array() 中的每一行。当某些信息告诉您正在处理新记录时,您将转储解析的内容并再次继续。

面向行的解析器有一个很好的特性:它们需要很少的内存,而且最重要的是恒定的内存。他们可以毫不费力地处理千兆字节的数据。我正在管理一堆生产服务器,没有什么比那些脚本将整个文件放入内存中更糟糕的了(然后用解析的内容填充数组,这需要两倍于原始文件大小的内存)。

这是可行的,并且基本上是牢不可破的:

<?php
$in_name = 'in.txt';
$in = fopen($in_name, 'r') or die();

function dump_record($r) {
    print_r($r);
}

$current = array();
while ($line = fgets($in)) {
    /* Skip empty lines (any number of whitespaces is 'empty' */
    if (preg_match('/^\s*$/', $line)) continue;

    /* Search for '123. <value> ' stanzas */
    if (preg_match('/^(\d+)\.\s+(.*)\s*$/', $line, $start)) {
        /* If we already parsed a record, this is the time to dump it */
        if (!empty($current)) dump_record($current);

        /* Let's start the new record */
        $current = array( 'id' => $start[1] );
    }
    else if (preg_match('/^(.*):\s+(.*)\s*/', $line, $keyval)) {
        /* Otherwise parse a plain 'key: value' stanza */
        $current[ $keyval[1] ] = $keyval[2];
    }
    else {
        error_log("parsing error: '$line'");
    }
}

/* Don't forget to dump the last parsed record, situation
 * we only detect at EOF (end of file) */
if (!empty($current)) dump_record($current);

fclose($in);
?>

显然,您需要在函数 dump_record 中适合您口味的东西,例如打印格式正确的 INSERT SQL 语句。

That will work with a very simple stateful line-oriented parser. Every line you cumulate parsed data into an array(). When something tells you're on a new record, you dump what you parsed and proceed again.

Line-oriented parsers have a great property : they require little memory and what's most important, constant memory. They can proceed with gigabytes of data without any sweat. I'm managing a bunch of production servers and there's nothing worse than those scripts slurping whole files into memory (then stuffing arrays with parsed content which requires more than twice the original file size as memory).

This works and is mostly unbreakable :

<?php
$in_name = 'in.txt';
$in = fopen($in_name, 'r') or die();

function dump_record($r) {
    print_r($r);
}

$current = array();
while ($line = fgets($in)) {
    /* Skip empty lines (any number of whitespaces is 'empty' */
    if (preg_match('/^\s*$/', $line)) continue;

    /* Search for '123. <value> ' stanzas */
    if (preg_match('/^(\d+)\.\s+(.*)\s*$/', $line, $start)) {
        /* If we already parsed a record, this is the time to dump it */
        if (!empty($current)) dump_record($current);

        /* Let's start the new record */
        $current = array( 'id' => $start[1] );
    }
    else if (preg_match('/^(.*):\s+(.*)\s*/', $line, $keyval)) {
        /* Otherwise parse a plain 'key: value' stanza */
        $current[ $keyval[1] ] = $keyval[2];
    }
    else {
        error_log("parsing error: '$line'");
    }
}

/* Don't forget to dump the last parsed record, situation
 * we only detect at EOF (end of file) */
if (!empty($current)) dump_record($current);

fclose($in);
?>

Obvously you'll need something suited to your taste in function dump_record, like printing a correctly formated INSERT SQL statement.

请止步禁区 2024-12-27 03:22:52

这会给你你想要的,

$array = explode("\n\n", $txt);
foreach($array as $key=>$value) {
    $id_pattern = '#'.($key+1).'. (.*?)\n#';
    preg_match($id_pattern, $value, $id);

    $location_pattern = '#Location \: (.*?)\n#';
    preg_match($location_pattern, $value, $location);


    $owner_pattern = '#Owner\: (.*?)\n#';
    preg_match($owner_pattern, $value, $owner);


    $architect_pattern = '#Architect\: (.*?)#';
    preg_match($architect_pattern, $value, $architect);

    $id = $id[1];
    $location = $location[1];
    $owner = $owner[1];
    $architect = $architect[1];

    mysql_query("INSERT INTO table (id, location, owner, architect) VALUES ('".$id."', '".$location."', '".$owner."', '".$architect."')");
//Change MYSQL query

}

This will give you what you want,

$array = explode("\n\n", $txt);
foreach($array as $key=>$value) {
    $id_pattern = '#'.($key+1).'. (.*?)\n#';
    preg_match($id_pattern, $value, $id);

    $location_pattern = '#Location \: (.*?)\n#';
    preg_match($location_pattern, $value, $location);


    $owner_pattern = '#Owner\: (.*?)\n#';
    preg_match($owner_pattern, $value, $owner);


    $architect_pattern = '#Architect\: (.*?)#';
    preg_match($architect_pattern, $value, $architect);

    $id = $id[1];
    $location = $location[1];
    $owner = $owner[1];
    $architect = $architect[1];

    mysql_query("INSERT INTO table (id, location, owner, architect) VALUES ('".$id."', '".$location."', '".$owner."', '".$architect."')");
//Change MYSQL query

}
不顾 2024-12-27 03:22:52

同意Topener解决方案,这里有一个例子,如果每个块是4行+空行:

$data = file('path/to/file.txt');
$id = 0;
$parsedData = array();
foreach ($data as $n => $row) {
  if (($n % 5) == 0) $id = (int) $row[0];
  else {
    $parsedData[$id][$row[0]] = $row[1];
  }
}

结构将很方便使用,对于MySQL或其他什么。我没有添加代码来从第一段中删除冒号。

祝你好运!

Agreed with Topener solution, here's an example if each block is 4 lines + blank line:

$data = file('path/to/file.txt');
$id = 0;
$parsedData = array();
foreach ($data as $n => $row) {
  if (($n % 5) == 0) $id = (int) $row[0];
  else {
    $parsedData[$id][$row[0]] = $row[1];
  }
}

Structure will be convenient to use, for MySQL or whatelse. I didn't add code to remove the colon from the first segment.

Good luck!

太阳男子 2024-12-27 03:22:52
preg_match_all("/(\d+)\.(.*?)\sLocation\s*\:\s*(.*?)\sOwner\s*\:\s*(.*?)\sArchitect\s*\:\s*(.*?)\s?/i",$txt,$m);

$matched = array();

foreach($m[1] as $k => $v) {

    $matched[$v] = array(
        "location" => trim($m[2][$v]),
        "owner" => trim($m[3][$v]),
        "architect" => trim($m[4][$v])
    );

}
preg_match_all("/(\d+)\.(.*?)\sLocation\s*\:\s*(.*?)\sOwner\s*\:\s*(.*?)\sArchitect\s*\:\s*(.*?)\s?/i",$txt,$m);

$matched = array();

foreach($m[1] as $k => $v) {

    $matched[$v] = array(
        "location" => trim($m[2][$v]),
        "owner" => trim($m[3][$v]),
        "architect" => trim($m[4][$v])
    );

}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文