通过php解析数据并将其存储到MySQL数据库时出现问题

发布于 2024-08-06 21:44:47 字数 1636 浏览 2 评论 0原文

抱歉重复这个问题，但在这里我尝试更详细地解释它。我需要解析某个文件中的数据并将其存储到数据库（MySQL）中。这就是数据在文件中的显示方式：

戚谊 
戚誼 
    [m1][b]qīyì[/b][/m] 
    [m2]translation 1[/m] 
    [m1][b]qīyi[b][/m] 
    [m2]translation 2[/m] 
三州府 
    [m1][b]sānzhōufǔ[/b][/m] 
    [m2]translation of other character[/m]
etc.

第一行和第二行代表相同的字符，但第一行是简体字符，第二行是繁体字符。我需要将它们相应地存储到 ch_simplified 和 ch_trad 列中。

第三行以[m1]开头，是转写（拼音），第四行（以[m2]开头）是汉字的翻译。还有该字符的第二个翻译，您可以注意到它有不同的转录。

我们需要将两个转录（有时同一个字符有两个以上的转录）存储在单独的列（transcription）中，然后将所有翻译部分存储到列translation。

mysql 数据库中的表如下所示：

ID  |  ch_simplified  |  ch_trad    | transcription           |   translation               | 
--------------------------------------------------------------------------------------------- 
1.        戚谊             戚誼        [m1][b]qīyì[/b][/m];     [m1][b]qīyì[/b][/m] 
                                      [m1][b]qīyi[b][/m]       [m2]translation 1[/m] 
                                                               [m1][b]qīyi[b][/m] 
                                                               [m2]translation 2[/m] 
---------------------------------------------------------------------------------------------
2.        三州府           三州府      [m1][b]sānzhōufǔ[/b][/m]  [m1][b]sānzhōufǔ[/b][/m] 
                                                               [m2]translation of other character[/m]

问题是我不知道如何使用 php 解析这些数据。开始

$content = file_get_contents('myfile.txt', true);

我尝试从必须将第一个字符和第二个字符（戚谊和三州府）之间的数据分开的步骤

并陷入困境。任何帮助将不胜感激！

PS 抱歉这么长的文字和令人困惑的解释。

原文

Sorry for duplicating this question, but here I tried to explain it in more details.
I need to parse the data from certain file and store it to database (MySQL). This is how the data is displayed in the file:

戚谊 
戚誼 
    [m1][b]qīyì[/b][/m] 
    [m2]translation 1[/m] 
    [m1][b]qīyi[b][/m] 
    [m2]translation 2[/m] 
三州府 
    [m1][b]sānzhōufǔ[/b][/m] 
    [m2]translation of other character[/m]
etc.

The first and the second line represent the same character, but the first line is a simplified and the second line is a traditional character. I need to store them to ch_simplified and ch_trad columns accordingly.

The third line, which begins with [m1], is a transcription (pinyin), the forth line (begins with [m2]) is a translation of the character. There is also the second translation of the character, you can notice it has different transcription.

We need to store both transcriptions (sometimes there are more than 2 transcriptions for the same character) in a separate column (transcription), and then store all translation part to a column translation.

And the table in mysql db looks like this:

ID  |  ch_simplified  |  ch_trad    | transcription           |   translation               | 
--------------------------------------------------------------------------------------------- 
1.        戚谊             戚誼        [m1][b]qīyì[/b][/m];     [m1][b]qīyì[/b][/m] 
                                      [m1][b]qīyi[b][/m]       [m2]translation 1[/m] 
                                                               [m1][b]qīyi[b][/m] 
                                                               [m2]translation 2[/m] 
---------------------------------------------------------------------------------------------
2.        三州府           三州府      [m1][b]sānzhōufǔ[/b][/m]  [m1][b]sānzhōufǔ[/b][/m] 
                                                               [m2]translation of other character[/m]

The problem is I don't know how parse this data using php. I tried to start with

$content = file_get_contents('myfile.txt', true);

and stuck at the step where I have to separate data between first character and the second character (戚谊 and 三州府).

Any help would be greatly appreciated!

P.S. Sorry for such a long text and confusing explanation.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

萌化 2024-08-13 21:44:47

您的数据字段位于不同的行上，因此 Phil 的explode() 调用将位于换行符上。所以基本的数据字段获取是这样的：

$content = file_get_contents('myfile.txt', true);

foreach(explode("\n", $content) as $line)
{
  $line = trim($line);  // remove leading white space
  // if necessary, check for empty lines here
  switch(substr($line, 0,4)) // examine first four characters
  {
    case '[m1]':
      // regular expression has some escaped characters
      preg_match('/^\[m1](.+)\[\/m]$/', $line, $matches);  
      $field = $matches[1];
      echo "pinyin: '$field'\n";
      break;

    case '[m2]':
      preg_match('/^\[m2](.+)\[\/m]$/', $line, $matches);
      $field = $matches[1];
      echo "translation: '$field'\n";
      break;

    default:
      $field = $line;  // for clarity
      echo "character: '$field'\n";
      break;
  }

}

在这里，我没有尝试识别（a）新记录的开始，或（b）简体字和繁体字的识别。这些问题可能通过计算字符字段标识来解决 - 第一个是简化的，第二个是传统的，第一个一会儿表示一个新字段 - 但这是你的工作。

我也没有评估与非 ASCII 字符集相关的任何问题。我想你已经掌握了这些东西。

我借此机会将内容与表示标记（如 [b] 标签）分开。将这些语义与数据分开是一个很好的做法。

Your data fields are on separate lines, so Phil's explode() call would be on the newline character. So the basic datafield acquisition is something like this:

$content = file_get_contents('myfile.txt', true);

foreach(explode("\n", $content) as $line)
{
  $line = trim($line);  // remove leading white space
  // if necessary, check for empty lines here
  switch(substr($line, 0,4)) // examine first four characters
  {
    case '[m1]':
      // regular expression has some escaped characters
      preg_match('/^\[m1](.+)\[\/m]$/', $line, $matches);  
      $field = $matches[1];
      echo "pinyin: '$field'\n";
      break;

    case '[m2]':
      preg_match('/^\[m2](.+)\[\/m]$/', $line, $matches);
      $field = $matches[1];
      echo "translation: '$field'\n";
      break;

    default:
      $field = $line;  // for clarity
      echo "character: '$field'\n";
      break;
  }

}

Here, I have not attempted to identify (a) the start of a new record, or (b) identification of simplified and trad characters. These issues are probably addressed by counting character field identifications -- first one is simplified, second trad, first for a while indicates a new field -- but that's your job.

Nor have I assessed any issues relating to the non-ascii character set. I assume you are on top of that stuff.

I have taken the opportunity to separate the content from presentational markup (like the [b] tags). It's just good practice to keep those semantics separate from the data proper.

回复收藏 0 原文