How not to write a parser

So if you’ve got two JSON objects, one that’s serialized as

{"Value": []}

and the other as

{
  "Value": [
  ]
}

is there any difference between them?

According to my web browser, which is basically the final arbiter when it comes to what is and is not valid JavaScript, no.  But according to Delphi’s DBXJSON library, there’s a big difference: the first form deserializes into exactly what you’d expect, but the second form is not valid.

I ran across this on StackOverflow a few days ago.  Turns out that the JSON parser’s whitespace handling is not correct when it sees a [ character.  The code is kind of complicated, but the basic idea is that if the character immediately after the [ is not a ], it will check for whitespace and then expect to see a valid JSON value.  But if the character immediately after the whitespace is a ], it errors out.

You could look at this and say “oh, yeah, that’s a bug, they need to check for whitespace before checking for the ]”.  But you’d be wrong.  The problem goes deeper than that; the problem is that the possibility for an error like this exists at all.  The DBXJSON parser is a shining example of how not to write a parser, particularly a parser for any non-trivial grammar, such as JSON: if your parser is looking at the characters in the input, you’re doing it wrong.

A parser should not care about characters; it should care about tokens.  A token is a simple record or object that tells the parser some basic information about a character or group of characters that makes up a grammatical element, such as what type of element it is, what it contains, and (if you care about error reporting) the token’s position in the input file.

But the tokens don’t come from the parser; they come from a lexer, a separate object whose job it is to read the input and turn it from a big string into a list of tokens.  The lexer has built-in rules about where a token starts and where it ends, and (in a language where whitespace is not significant, at least) it will automatically skip whitespace after the end of every token.

If the DBXJSON parser had a proper lexer, this problem wouldn’t exist.  It couldn’t exist.  The logic would go like this (pseudocode):

If CurrentToken.type = OpenBracket then
   GetToken;
   If CurrentToken.Type = CloseBracket then
      MakeEmptyArray
   else ParseArrayContents;

Have a look at the actual parsing code in TJSONObject and just think about how much cleaner and easier to read it would be if all the string-reading code was off in its own class and the parser only concerned itself with determining the meaning of the sequence of elements in the string.

So if you ever need to write a real parser for some task too complicated to handle with a TStringList, remember: use a lexer! It’s a little bit more work right at the start, but it will make your task much simpler overall.

7 Comments

  1. Marco Cantu says:

    Yes, the Delphi JSON parser has a few issues, but at some of them have been fixed, like the initial lack of support for Unicode and the errors showing up for any space or white space. It is true that the internal structure isn’t the best around! Well, I told you this is person earlier today , but wanted to comment anyway and thank you for the comments, feedback, suggestions, and chats during the past 4 days.

    See you soon.

    -Marco

    • Sam says:

      Marco, this SO issue was one I submitted and am thankful for the help I received there to get past this Delphi bug. I just want to take a second to thank you for your Delphi 2010 whitepaper on consuming REST services in Delphi. It has been extremely helpful. Most of our organization’s services that I consume have been converted from SOAP to REST. Your whitepaper was invaluable in helping me make the transition to consuming the new services.

      For any other Delphi developer needing a primer on using REST services in Delphi, here’s a link to Marco’s blog page: http://blog.marcocantu.com/blog/rest_delphi_2010_whitepaper_published.html

  2. Eric says:

    JSON is not very complex actually: a JSON tokenizer will have been doing almost all the job that is required in parsing, leaving only the creation of a DOM or AST (or callbacks for a sequential access parser). So in such a case, you can merge the lexer and the parser, otherwise you’re going to introduce inefficiency (not that DBXJSON is very efficient…).

    FWIW, if you need a fast and tested JSON parser, the dwsJSON unit in DWScript can be used “standalone”. It doesn’t suffer from the bug you mentionned, is much faster than DBXJSON (about 20x), faster than SuperObject (about 2x), can handle complex JSON and supports full Unicode.

    All in all, DBXJSON is probably the worse JSON implementations in Delphi that I’ve seen.

    • deksden says:

      I m sure go DWS way if i will choose tech or portions of code! Tnx for info about parser! Maybe you can post it as a child project to GitHub? And use it in DWS as external? Separating this code can make it more popular, and dws “brand” in its name will pay you))

      • Eric says:

        I’m not too sure about the “discoverability” of stuff in github: usually github projects can only be found if referenced in some “regular” site or blog.

        I’ll try to post an article about it though.

  3. Nicholas Ring says:

    Good topic for a new blog: The process of writing a parser – In Delphi, of course 😀

  4. Nice read! If I look over the JSON parser I write for my mongoDB connector:
    https://github.com/stijnsanders/TMongoWire/blob/master/bsonUtils.pas
    It looks like I’m doing ‘lexing’ with inline procedures like SkipWhitespace, Expect and GetStringValue, and the big case at the center of the big while loop!

Leave a Reply