So if you’ve got two JSON objects, one that’s serialized as
and the other as
is there any difference between them?
I ran across this on StackOverflow a few days ago. Turns out that the JSON parser’s whitespace handling is not correct when it sees a [ character. The code is kind of complicated, but the basic idea is that if the character immediately after the [ is not a ], it will check for whitespace and then expect to see a valid JSON value. But if the character immediately after the whitespace is a ], it errors out.
You could look at this and say “oh, yeah, that’s a bug, they need to check for whitespace before checking for the ]”. But you’d be wrong. The problem goes deeper than that; the problem is that the possibility for an error like this exists at all. The DBXJSON parser is a shining example of how not to write a parser, particularly a parser for any non-trivial grammar, such as JSON: if your parser is looking at the characters in the input, you’re doing it wrong.
A parser should not care about characters; it should care about tokens. A token is a simple record or object that tells the parser some basic information about a character or group of characters that makes up a grammatical element, such as what type of element it is, what it contains, and (if you care about error reporting) the token’s position in the input file.
But the tokens don’t come from the parser; they come from a lexer, a separate object whose job it is to read the input and turn it from a big string into a list of tokens. The lexer has built-in rules about where a token starts and where it ends, and (in a language where whitespace is not significant, at least) it will automatically skip whitespace after the end of every token.
If the DBXJSON parser had a proper lexer, this problem wouldn’t exist. It couldn’t exist. The logic would go like this (pseudocode):
If CurrentToken.type = OpenBracket then GetToken; If CurrentToken.Type = CloseBracket then MakeEmptyArray else ParseArrayContents;
Have a look at the actual parsing code in TJSONObject and just think about how much cleaner and easier to read it would be if all the string-reading code was off in its own class and the parser only concerned itself with determining the meaning of the sequence of elements in the string.
So if you ever need to write a real parser for some task too complicated to handle with a TStringList, remember: use a lexer! It’s a little bit more work right at the start, but it will make your task much simpler overall.