4 July 2007

XML: Parser

Creating an XML parser that works like StaX out of a FSM that can process partial chunks of the document as it asynchronously arrives is kinda hard. Especially when you are trying to do it without any heap activity..

Paul at 3:47 pm



4 Comments

  1. Yes. But definitely doable.
    Now, assuming I understand you right, one main benefit would be the ability to do non-blocking parsing: that is, either get next event, or indication of “not enough data yet to know”. For what it’s worth, I actually have written core of such a parser, and one that does implement Stax API too. At this point the main challenge is figuring out who would care enough to want to get it ready for real use. ;-)

    There are already enough good-but-blocking parsers for this to be bit of a niche thing I suppose.
    Oh, also, I didn’t use FSM, but that’s mostly since it gets rather hairy when one wants to also integrate utf-decoding with parsing, for further performance gains.

    Comment by Cowtowncoder — July 19, 2007 @ 8:29 am

  2. I’m glad you have done it, and thanks for the encouragement :) Woodstox is the parser mine looks up to..

    I hope the FSM thing works out for me, since its somewhere around 400 states already (I could reduce that by making it a pushdown automata..), utf characters get decoded to Unicode only as they are needed. I plan on using what I call a “Tim Bray technique” where chars > 128 get transformed to corresponding ASCII characters for the sake of keeping the transition characters in the FSM low, but inside the state the original Unicode character is checked for its validity to be present in whatever part of the document we are looking at. Perhaps if profiling shows to much work is being done, I could make it happen over longer runs of characters.

    Yes, I want it to be a niche thing (and it should be just because it is written for the D programming language), and you did read my intentions right. I want the parser to be suitable for environments like jabber servers or handling thousands of concurrent REST requests etc. I’m not sure anyone cares to do this at such a low level, so I may end up building some recursive descent parser generator that fits with an asynchronous IO framework..

    Hopefully someone wants to pay me somehow for the end usage to compensate for being the only person in the world who cares :)

    Comment by Paul — July 20, 2007 @ 2:05 am

  3. [...] combination of these XPath expressions get applied like a giant state-machine/trie to the incoming XML events in something approximating log(n) time, so large numbers of XML streams can be handled [...]

    Pingback by Paul Findlay » XML: Impossible to be original — July 20, 2007 @ 10:46 am

  4. Hey Paul, good luck with the development! Like I said, it’s a non-trivial task. But I think using a state machine is a good idea (in case it wasn’t obvious from my first comment), and it was something I was thinking about too. In the end choice had more to do with possibility to recycle some existing pieces, than feasibility of approaches.
    I hope to learn more about the implementation if and when you get that far.

    Comment by Cowtowncoder — October 13, 2007 @ 11:41 am

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.