在Boost :: Spirit :: Qi中解析未完成的文字
我想通过使用ifstream ::读取读取文本。我面临的问题是,解析器在阅读未完成的文本时总是会返回期望失败。这是我的解析器代码。
template <typename It, typename Skipper= qi::space_type>
struct xmlparser: qi::grammar<It, std::string(), Skipper>{
xmlparser(): xmlparser::base_type(xml_parser){
using qi::lit;
using qi::lexeme;
using ascii::char_;
using boost::phoenix::ref;
using qi::debug;
using boost::spirit::ascii::space;
skipper= qi::char_("\t\r\n "); //qi::skip(skipper.alias())
text = !lit('<') >> +(qi::char_ - qi::char_("<")) | lit('\'') | lit('\"');
prolog = "<?" >> +(qi::char_ - '?') >> "?>";
name = lexeme[qi::char_("a-zA-Z:_") >> *qi::char_("-a-zA-Z0-9:_")];
attribute_value =
'"' > +(char_ - char_("<&\"")) > '"'
| '\'' > +(char_ - char_("<&'")) > '\''
;
attribute = name[print_action("ATT")] > '=' > attribute_value[print_action("ATT VALUE")];
start_tag %= '<' >> !lit('/') >> name >> *(attribute) >> !lit('/')>> '>';
end_tag = "</" >> name >> '>';
empty_tag = '<' >> name >> *(attribute) >> "/>";
xml_parser =
*(text/*[print_action("TEXT")]*/
|start_tag[/*++ref(open_tag_count)*/print_action("OPEN")]
| end_tag[/*++ref(end_tag_count)*/print_action("END")]
| empty_tag[/*++ref(empty_tag_count)*/print_action("EMPTY")]
| prolog
| skipper
);
}
int get_empty_tag_count(){
return empty_tag_count;
}
int get_open_tag_count(){
return open_tag_count;
}
int get_end_tag_count(){
return end_tag_count;
}
private:
int open_tag_count= 0;
int end_tag_count= 0;
int empty_tag_count= 0;
int text_count=0;
qi::rule<It> skipper;
qi::rule<It, std::string()> text;
qi::rule<It, std::string()> prolog;
qi::rule<It, std::string(),Skipper> name;
qi::rule<It, std::string()> attribute_value;
qi::rule<It, std::string(),Skipper> attribute;
qi::rule<It, std::string(),Skipper> start_tag;
qi::rule<It, std::string(),Skipper> end_tag;
qi::rule<It, std::string(),Skipper> empty_tag;
qi::rule<It, std::string(),Skipper> xml_parser;
};
当我阅读使用ifstream :: getline时,我没有任何问题,因为可以将馈入解析器的文本被视为完整。但是,当我通过使用ifstream ::读取文本时,例如,如果发生char [bufsize]在解析XML属性的中间停止,然后将返回期望失败。
未完成文本的示例
</description>
<shipping>Will ship only within country, See description for charges</shipping>
<incategory category="category317"/>
<incategory category="categ
读取字符的函数,
char * buffer= new char[bufsize];
input_file.read(buffer,bufsize);
std::string bufstring(buffer);
if (extra != ""){
bufstring = extra + bufstring;
extra= "";
}
我希望知道是否可以返回失败分析值,然后添加到后续读取中,因为随后的读取包含未完成的文本的延续。我尝试写作尝试和捕捉,以便将未能解析为下一个字符读取的文本,但似乎不起作用。
if (extra != ""){
bufstring = extra + bufstring;
extra= "";
}
// std::cout << bufstring << std::endl << std::endl;
std::string::const_iterator iter= bufstring.begin();
std::string::const_iterator end= bufstring.end();
try{
bool r= qi::phrase_parse(iter,end,xml_parser,qi::space);
if (!r){
std::cout << "Error found" << std::endl;
extra = std::string(iter,end);
std::cout << extra << std::endl;
delete[] buffer;
return;
}
if (iter!=end){
extra = std::string(iter,end);
// std::cout << extra << std::endl;
}
} catch (expectation_failure<char const*> const& e){
std::cout<< std::string(iter,end) << std::endl;
extra = std::string(iter,end);
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
不要滚动自己的XML解析器。 XML库为此目的具有流解析器。他们一堆。如果您有兴趣,我可以去寻找我的XPathReader实现。
就是说,也许您只是在尝试学习精神气。让我们挖掘。
很多小观察:
船长
似乎是一个非常不好的名字,因为您已经有skipper
(QI功能)qi :: char _(“ \ t \ t \ r \ n” )
非常接近qi :: space
- 如果不是完全相同的。断言
!lit('/')
当您期望name
eyways(不能合法地以'/'的'/':frem以'/'的开头。char_ -char_(set)
只是〜char(set)
您似乎确实有太多的期望点。其中一些人毫无意义。请参阅eg
属性值可以为空
您没有定义
代码> print_action 。也许只是
或功能更大的
主要是因为所有规则...公开字符串属性。
关于那。
具有接受所有文本直至XML标签或单个报价字符的净效应。即使您实际上是指:
它将接受所有文本,直到XML标签但吞咽'''和“”“出于...原因?
所有规则都揭示字符串属性?! ?
示例
建议您只是象征化。如果是这样,您可能想揭示输入序列(
QI :: RAW
),并且您不想在现在进行的插入等词汇。从可转换容器而不是输入迭代器中解析,您将允许您避免复制源序列(要么使用std :: string_view或boost :: iterator :: iterator_range)。
)。
而不是定制
print_action
考虑使用内置语法调试:在PEG语法中订购规则有概念性的事情。钉子贪婪,从左到右意味着可以重新排序EG
更像
(除非您将其XML评论意识到,否则船长没有意义)
现在您可以从
text> text
的开头删除!
语义动作存在问题( boost spirit:“语义动作”是邪恶的? )和容器属性上的副作用( boost :: Spirit tum tum tum the当默认值)。问题主要是语义动作。另外,我建议您通常将“空”标签(在许多感觉上都不为空)作为开放/关闭组合。这可以防止所有属性双重解析
更新语义动作中的解析器成员也很惊讶 - 违反
const
的合同
.parse(...)const
,因为语义动作嵌入了对成员数据的可变引用。在有效XML方面存在很多正确的问题。您的语法甚至没有尝试验证匹配的开放/关闭标签。
Prolog
是实际的处理指令。我尚未检查?
在>?&gt;
之前无法单独发生?吗?
我不认为有效的XML元素可以以
开头:
。没有关于名称空间,实体参考,CDATA,pcdata的规定。而且,我们甚至不会打开XSD分辨率/验证的蠕虫罐头。在已经提到的“不要写自己的XML解析器”的情况下,该子弹将被遗忘。公平地说,野外的许多较小的XML库也有类似的局限性。期望点。您需要决定要发生的事情。您只想解析有效的XML吗?然后,您需要严格的期望,因为XML 需要元素标签,
&lt;
是在文本/字符串上下文之外遇到的。这与解析相一致。也许您可以选择“吞下”所有期望:
另外,您可以在输入结束时有选择地处理它们:
将演示放在一起
在coliru上进行
打印
,如果启用了调试输出:
ps
如果您
#define init%=
您会看到问题在每个字符串上展示std :: String()
:Don't roll your own XML parser. XML libraries have stream parsers for this purpose. They work a bunch. I could go look for my xpathreader implementation if you're interested.
That said, maybe you're just trying learn Spirit Qi. Let's dig in.
Lots of small observations:
skipper
seems like a Very Bad name since you already haveSkipper
(the Qi feature)qi::char_("\t\r\n ")
is very close toqi::space
- if not identical.asserting
!lit('/')
is useless when you expectname
anyways (which cannot legally start with '/').char_ - char_(set)
is just~char(set)
You do seem to have too many expectation points. Some of them make little sense. See e.g. boost::spirit::qi Expectation Parser and parser grouping unexpected behaviour
Attribute values can be empty
Names are lexeme, so why declare a skipper? (see Boost spirit skipper issues)
You didn't define
print_action
. Perhaps it is simplyOr a slightly more functional
Which mainly works because all rules ... expose a string attribute.
About that.
Has the net effect of accepting all text until an xml tag, or a single quote character. Even if you actually meant:
it would be accepting all text until an xml tag but swallowing '' and '"' for... reasons?
All rule expose a string attribute?! What are you parsing? What are you parsing for?
The example
Suggest that you're merely tokenizing. If so, you probably want to expose the input sequences (
qi::raw
) and you wouldn't want to drop lexical items like the interpunction as you're doing now.Parsing from a forward-traversable container instead of input iterator would allow you to avoid copying the source sequences at all (either using a std::string_view or boost::iterator_range instead).
Instead of a bespoke
print_action
consider using the builtin grammar debugging:There's a conceptual thing on ordering rules in PEG grammars. PEG being greedy, left-to-right means that can reorder e.g.
To be more like
(Where skipper doesn't make sense unless you were to make it XML comment-aware)
Now you can drop the
!lit('<')
from the start oftext
as well.There's a problem with semantic actions (Boost Spirit: "Semantic actions are evil"?) and side-effects on container attributes (Boost::Spirit doubles character when followed by a default value). The problem mainly being semantic actions. Also, I suggest that you would normally parse an "empty" tag (which isn't empty in many senses) as an open/close combo. This prevents all attributes being parsed doubly
Updating parser members from semantic actions is also a surprise - violating the
const
contract of.parse(...) const
because the semantic actions embed mutable references to member data.There's a lot of correctness issues with respect to valid XML. Your grammar doesn't even try to validate matching pairs of open/close tags.
prolog
is actual a processing instruction. I haven't checked that?
cannot occur on its own before?>
, have you?I don't think valid XML elements can start with
:
. There's no provision for namespaces, entity references, CDATA, PCDATA. And we'll not even open the can of worms that is XSD resolution/validation. This bullet will just be forgotten under the already mentioned "don't write your own XML parser". In fairness, many smaller XML libraries in the wild also have limitations like these.The expectation points. You need to decide what you want to happen. Do you want to parse only valid XML? Then you need strict expectations, as XML requires element tags as soon as
<
is encountered outside text/string context.However this is at odds with parsing. Maybe you can optionally "swallow" all expectations:
Alternatively you could selectively deal with them when at end of input:
Putting Together A Demo
Live On Coliru
Prints
And, if enabled, the debug output:
PS
If you
#define INIT %=
you will see the problem with exposingstd::string()
on each string: