使用 Python 解析文本：非结构化但具有不同格式的相似信息

发布于 2024-10-31 12:34:29 字数 6257 浏览 3 评论 0原文

我正在尝试使用 Python 解析数千个包含公司、材料、化学特性等（具体来说是材料安全数据表）的规格表文本文件。文本文件包含结构松散格式的类似信息，以便人类可读，但结构不够结构化，不易解析（例如不是 XML 或 CSV）。简而言之，到处都是。

最初，数据是由不同公司的不同人员手工输入的。另一组人将信息转录到这些文本文件中（将其 OCR 转换为 txt 文件）。

是否有解析库或模式来提取这种类型的信息位？（这似乎是一个“常见”的数据输入问题。）当然正则表达式会被大量使用。我对自然语言处理库没有任何经验。它们是否适合解决这个问题？

我最初的想法是尝试将文件分组到不同的类别中，然后为每种格式创建一组解析函数。不幸的是，他可能只适用于问题的一小部分，并且不同的情况可能很快就会失控。

由于这个问题很普遍，我将提供一堆示例来说明该问题。

地址信息
每个文件都包含公司信息，例如信息和地址。该信息可能有标识符，也可能没有，它可能在一行，也可能不在一行，等等。简而言之，似乎有每一种组合。

例如（带字段信息）：

MANUFACTURER: Foo Bar Inc.  
ADDRESS: 123 Foo St.  
Bar, CA 90012

例如。（无/字段信息）：

Foo Bar Inc.  
123 Foo St.  
Bar, CA 90012

例如。（有时信息之间有额外的行）：

FOO BAR INC.

123 FOO ST.

BAR, CA 90012

例如。（字段名称不一致）：

MANUFACTURER'S NAME: FOO BAR INC.  
CREATIVE DIVISION  
ADDRESS: 123 FOO ST.  
CITY, STATE & ZIP: BAR, CALIFORNIA 90012  
PHONE NUMBER: 310-111-2222

SECTION INFO
规格表也有类似的部分，但顺序、标题、数字类型和分隔符不一致。

例如：

========================================
SECTION 1 -- MATERIALS
========================================

Ex：

Section I. Materials
------------------------------------------

Ex：

----- Section 3       Materials

有时文件的宽度发生了变化，因此以下换行符。

Ex：

===================================================
1.    Materials
===================================================

变成：

=========================================
==========
1.    Materials
=========================================
==========

这是一个完整的示例：
希望这能澄清解析文件的问题。您会注意到换行、信息分割在不同的行上等。并非所有内容都具有精确的结构，有些格式会有所不同，信息位于不同的位置。以下是纸质硬拷贝的链接。

MATERIAL SAFETY DATA SHEET

=================================================================
=========
SECTION I-PRODUCT AND PREPARATION INFORMATION
=================================================================
=========

MANUFACTURER:         Some Company Inc     EMERGENCY AND
INFORMATION
TELEPHONE
(111)222-3333
ADDRESS:              Some Road
City, ST
12346

IDENTITY (AS USED ON
LABEL AND LIST):      Some Identity

PREPARATION DATE:     Some Date

=================================================================
=========
SECTION II-HAZARDOUS INGREDIENTS/IDENTITY INFORMATION
=================================================================
=========

OSHA
ACGIH
HAZARDOUS COMPONENTS             CAS#       PEL   TWA        TLV
%
(SPECIFIC CHEMICAL IDENTITY;
COMMON NAME(S)
-----------------------------------------------------------------
---------

Some Chemical             111-22-3   15    10         10
12.34


=================================================================
=========
SECTION III-PHYSICAL/CHEMICAL CHARACTERISTICS
=================================================================
=========

Boiling Point:              N/A  Specific Gravity (H20=1):   N/A
Vapor Pressure (mm Hg):     N/A  Melting Point:              N/A
Vapor Density (AIR=1)       N/A  Evaporation Rate
(Butyl Acetate=1)           N/A
Solubility in Water:        None

Appearance:  Solid, various colors, may have slight
odor.

N/A = Not applicable

=================================================================
=========
SECTION IV-FIRE AND EXPLOSION HAZARD DATA
=================================================================
=========

FLASH POINT (METHOD USED):  None
FLAMMABLE LIMITS:  None          LEL:  N/A        UEL:  N/A
EXTINGUISHING MEDIA:  None
SPECIAL FIRE FIGHTING PROCEDURES:  None required.
UNUSUAL FIRE AND EXPLOSION HAZARDS:  None.

=================================================================
=========
SECTION V-REACTIVITY DATA
=================================================================
=========

STABILITY:  Stable
CONDITIONS TO AVOID:  None
INCOMPATIBILITY (MATERIALS TO AVOID):  None
HAZARDOUS POLYMERIZATION:  Will not occur

=================================================================
=========
SECTION VI-HEALTH HAZARD DATA
=================================================================
=========

ROUTES OF ENTRY:

INHALATION:  Yes
SKIN:  Possibly
INGESTION:  Possibly
EYES:  Possibly

HEALTH HAZARDS (ACUTE AND CHRONIC):  Pneumoconiosis, silicosis,
emphysema,
nose and throat irritation, eye irritation, skin irritation in
some.

CARCINOGENICITY:  No applicable information found.

SIGNS AND SYMPTOMS OF EXPOSURE:  Coughing, sneezing; irritation
of the
mucous membranes; eye irritation; skin irritation or rash, dry
throat.

MEDICAL CONDITIONS GENERALLY AGGRAVATED BY EXPOSURE:  Nasal,
bronchial or
pulmonary conditions which tend to restrict breathing, skin
abrasions.

EMERGENCY AND FIRST AID PROCEDURES:  Remove to fresh air,
irrigate eyes,
wash with soap and water, contact physician if necessary.

=================================================================
=========
SECTION VII-PRECAUTIONS FOR SAFE HANDLING AND USE
=================================================================
=========

STEPS TO BE TAKEN IN CASE MATERIAL IS RELEASED OR SPILLED:
Normal clean-up
procedures.

WASTE DISPOSAL METHOD:  Standard landfill methods consistent with
applicable state and federal regulations.

PRECAUTIONS TO BE TAKEN IN HANDLING AND STORING:  Use caution not
to drop,
crush, break or chip.

OTHER PRECAUTIONS:  Do not use at speeds greater than the
not-to-exceed
speed printed on the hub assembly.

=================================================================
=========
SECTION VIII-CONTROL MEASURES
=================================================================
=========

RESPIRATORY PROTECTION (SPECIFY TYPE):  OSHA or NIOSH approved
respirators
may be required.

VENTILATION:  Local exhaust recommended.  Special:  N/A.
Mechanical:  Useful.  Other:  N/A.

PROTECTIVE GLOVES:  May be useful.

EYE PROTECTION:  Recommended.

OTHER PROTECTIVE CLOTHING OR EQUIPMENT:  Not required.

WORK/HYGIENIC PRACTICES:  Keep clothing and area clean.  Wash to
remove

原文

I'm trying to parse thousands of spec sheet text files containing company, material, chemical properties, etc. (Material Safety Data Sheets, to be specific) with Python. The text files contain similar information in loosely structured formatting such that it's human readable, but unstructured enough that it's not easily parsed (e.g. not XML or CSV). In short, it's just all over the place.

Originally the data is entered by different people working in different companies by hand. Another set of people transcribe the information into these text files (OCR it into a txt file).

Is there a parsing library or patterns to extract bits of information of this type? (This seems to be a "common" data entry problem.) Certainly regular expressions will be used a lot. I don't have any experience with natural language processing libraries. Would they even be appropriate for the problem?

My initial thought is to try and group the files in different caegories, then create a set of parsing functions for each format. Unfortunately his may only work for a small subset of the problem and the different cases could quickly spiral out of control.

Since this question general I'll provide a bunch of examples illustrating the problem.

ADDRESS INFORMATION
Each file contains company information such as information and address. The information may or may not have an identifier, it may or may not be on one line, etc. In short, there seems to be every combination.

Ex.(w/ field info):

MANUFACTURER: Foo Bar Inc.  
ADDRESS: 123 Foo St.  
Bar, CA 90012

Ex. (wo/ field info):

Foo Bar Inc.  
123 Foo St.  
Bar, CA 90012

Ex. (Sometimes extra lines between information):

FOO BAR INC.

123 FOO ST.

BAR, CA 90012

Ex. (inconsistent field names):

MANUFACTURER'S NAME: FOO BAR INC.  
CREATIVE DIVISION  
ADDRESS: 123 FOO ST.  
CITY, STATE & ZIP: BAR, CALIFORNIA 90012  
PHONE NUMBER: 310-111-2222

SECTION INFO
The spec sheets also have similar sections but are inconsistent orders, headings, numeral types and delimiters.

Ex:

========================================
SECTION 1 -- MATERIALS
========================================

Ex:

Section I. Materials
------------------------------------------

Ex:

----- Section 3       Materials

And sometimes the files had their width changed, so the following line breaks.

Ex:

===================================================
1.    Materials
===================================================

Becomes:

=========================================
==========
1.    Materials
=========================================
==========

Here is a complete example:
Hopefully this will clarify the issues parsing the file. You'll notice the line wrapping, information split on different lines, etc. Not all have the exact structure, some will be formatted differently, with information in different places. Here is a link to a paper hard copy.

MATERIAL SAFETY DATA SHEET

=================================================================
=========
SECTION I-PRODUCT AND PREPARATION INFORMATION
=================================================================
=========

MANUFACTURER:         Some Company Inc     EMERGENCY AND
INFORMATION
TELEPHONE
(111)222-3333
ADDRESS:              Some Road
City, ST
12346

IDENTITY (AS USED ON
LABEL AND LIST):      Some Identity

PREPARATION DATE:     Some Date

=================================================================
=========
SECTION II-HAZARDOUS INGREDIENTS/IDENTITY INFORMATION
=================================================================
=========

OSHA
ACGIH
HAZARDOUS COMPONENTS             CAS#       PEL   TWA        TLV
%
(SPECIFIC CHEMICAL IDENTITY;
COMMON NAME(S)
-----------------------------------------------------------------
---------

Some Chemical             111-22-3   15    10         10
12.34


=================================================================
=========
SECTION III-PHYSICAL/CHEMICAL CHARACTERISTICS
=================================================================
=========

Boiling Point:              N/A  Specific Gravity (H20=1):   N/A
Vapor Pressure (mm Hg):     N/A  Melting Point:              N/A
Vapor Density (AIR=1)       N/A  Evaporation Rate
(Butyl Acetate=1)           N/A
Solubility in Water:        None

Appearance:  Solid, various colors, may have slight
odor.

N/A = Not applicable

=================================================================
=========
SECTION IV-FIRE AND EXPLOSION HAZARD DATA
=================================================================
=========

FLASH POINT (METHOD USED):  None
FLAMMABLE LIMITS:  None          LEL:  N/A        UEL:  N/A
EXTINGUISHING MEDIA:  None
SPECIAL FIRE FIGHTING PROCEDURES:  None required.
UNUSUAL FIRE AND EXPLOSION HAZARDS:  None.

=================================================================
=========
SECTION V-REACTIVITY DATA
=================================================================
=========

STABILITY:  Stable
CONDITIONS TO AVOID:  None
INCOMPATIBILITY (MATERIALS TO AVOID):  None
HAZARDOUS POLYMERIZATION:  Will not occur

=================================================================
=========
SECTION VI-HEALTH HAZARD DATA
=================================================================
=========

ROUTES OF ENTRY:

INHALATION:  Yes
SKIN:  Possibly
INGESTION:  Possibly
EYES:  Possibly

HEALTH HAZARDS (ACUTE AND CHRONIC):  Pneumoconiosis, silicosis,
emphysema,
nose and throat irritation, eye irritation, skin irritation in
some.

CARCINOGENICITY:  No applicable information found.

SIGNS AND SYMPTOMS OF EXPOSURE:  Coughing, sneezing; irritation
of the
mucous membranes; eye irritation; skin irritation or rash, dry
throat.

MEDICAL CONDITIONS GENERALLY AGGRAVATED BY EXPOSURE:  Nasal,
bronchial or
pulmonary conditions which tend to restrict breathing, skin
abrasions.

EMERGENCY AND FIRST AID PROCEDURES:  Remove to fresh air,
irrigate eyes,
wash with soap and water, contact physician if necessary.

=================================================================
=========
SECTION VII-PRECAUTIONS FOR SAFE HANDLING AND USE
=================================================================
=========

STEPS TO BE TAKEN IN CASE MATERIAL IS RELEASED OR SPILLED:
Normal clean-up
procedures.

WASTE DISPOSAL METHOD:  Standard landfill methods consistent with
applicable state and federal regulations.

PRECAUTIONS TO BE TAKEN IN HANDLING AND STORING:  Use caution not
to drop,
crush, break or chip.

OTHER PRECAUTIONS:  Do not use at speeds greater than the
not-to-exceed
speed printed on the hub assembly.

=================================================================
=========
SECTION VIII-CONTROL MEASURES
=================================================================
=========

RESPIRATORY PROTECTION (SPECIFY TYPE):  OSHA or NIOSH approved
respirators
may be required.

VENTILATION:  Local exhaust recommended.  Special:  N/A.
Mechanical:  Useful.  Other:  N/A.

PROTECTIVE GLOVES:  May be useful.

EYE PROTECTION:  Recommended.

OTHER PROTECTIVE CLOTHING OR EQUIPMENT:  Not required.

WORK/HYGIENIC PRACTICES:  Keep clothing and area clean.  Wash to
remove

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

绻影浮沉 2024-11-07 12:34:29

我会编写一个包含大量状态变量的 for 循环，处理每一行，并使用状态变量来跟踪正在发生的情况。 for 循环内的条件 (if) 会提出与人类手动解析文件时必须做的相同的“问题”。

"
for line in file:
    Is there a colon in line?
        field_name = normalize(informaton before the colon)
        data = information after the colon
    else: 
        field_name = next_field_in_list(previous_field)
        data = line
"

等等。
我无法从示例中理解您是否至少对字段有固定的顺序，
以及每个记录的最大字段数或不同的记录分隔符。如果没有这些，我想写起来会比较困难。

I'd write a for loop with lots of state variables, processing each line, and use the state variables to keep track of what is going on. The condtionals (if) inside the for loop would make the same "questions" a human would have to do would he be parsing the file by hand.

"
for line in file:
    Is there a colon in line?
        field_name = normalize(informaton before the colon)
        data = information after the colon
    else: 
        field_name = next_field_in_list(previous_field)
        data = line
"

And so on.
I could not understand from the examples if you at least have a fixed order for the fields,
and either a maximum number of fields per record or a distinct record separator. Without these, I think it would be harder to write.

回复收藏 0 原文

~没有更多了~