Parsing JSON

Tuesday, July 2, 2019

Hello friends,

Last time I talked with you about parsing INI files. Now let’s parse some JSON.

There is no good reason to do this in JavaScript since it’s built in and much faster. Well there is one good reason. Parsers are super fun. ANYWAY let’s dive in.

Lexer

First thing we want to do is take the file string and turn it into an array of tokens.

//Make an enum it's just easier to type this way.
const _types = [
  "LEFT_BRACKET",
  "RIGHT_BRACKET",
  "LEFT_BRACE",
  "RIGHT_BRACE",
  "TRUE",
  "FALSE",
  "NULL",
  "COMMA",
  "STRING",
  "COLON",
  "QUOTE"
];

const TYPES = {};
_types.forEach((type, index)=>{
	TYPES[type] = index;
});
function getString(text, index){
  let str = "";
  let len = 0;
  for(let i = index + 1; i<text.length; i++){
  	if(text[i] === `
"
`){
    	return [str, len];
    } else if(text[i] === "\\" && text[i+1] === `
"
`){
    	str += `
"
`;
      len +=2;
      i++;
    } else{
    	len++;
     	str += text[i];
    }
  }
  
  return null;
}


function lex(text) {
  let index = 0;
  const result = [];
  while(index < text.length){
  	const at = text[index];
  	switch(at){
    case "{":
    	result.push({type: TYPES.LEFT_BRACE});
      index++;
      break;
    case "}":
    	result.push({type: TYPES.RIGHT_BRACE});
      index++
      break;
    case ":":
    	result.push({type: TYPES.COLON});
      index++;
      break;
    case ",":
    	result.push({type: TYPES.COMMA});
      index++;
      break;
    case "n":
    	if(text.substring(index, index+4) === "null"){
      	index+=4
        result.push({type: TYPES.NULL});
      } else {
      	throw new Error("unexpected character parsing null");
      } 
      break;
    case "f":
   	  if(text.substring(index, index+5) === "false"){
      	index+=5
        result.push({type: TYPES.FALSE});
      } else {
      	throw new Error("unexpected character parsing false");
      } 
      break;
    case "t":
   		if(text.substring(index, index+4) === "true"){
      	index+=4
        result.push({type: TYPES.TRUE});
      } else {
      	console.log(text.substring(index, index+4));
      	throw new Error("unexpected character parsing true");
      } 
      break;
    case "\"":
    	const str = getString(text, index);
    	if(str === null){
      	throw new Error("Expected end of string");
      }
    	result.push({type: TYPES.STRING, value: str[0]});
      index += str[1] + 2;
      break;
    case "[":
    	result.push({type: TYPES.LEFT_BRACKET});
      index++
      break;
    case "]":
    	result.push({type: TYPES.RIGHT_BRACKET});
      index++
      break;
    default: 
      index++ 
    }
  }
  return result
}

So it’s pretty simple. We are going over the string and when we encounter certain characters we turn that into a token and push it into an array. We also validate strings and true, false, null

Tokenizing the string makes it REALLY simple to write the rest of the parser.

SIDE NOTE: I just realized this parser doesn’t handle numbers. I’m going to claim I left that as an exercise for the reader. SO do that.

You’ll also see we handle the case of escaping quotes. We could escape several other things. But that’s way too much work. That’s also an exercise for the reader…

Parsing

Now we get to have some fun parsing the tokens we have.

function parse(tokens, index){
	const t = tokens[index];
 
  switch(t.type){
  	case TYPES.STRING:
    	return [t.value, index+1];
    case TYPES.FALSE:
    	return [false, index+1];
    case TYPES.TRUE:
    	return [true, index+1];
    case TYPES.NULL:
    	return [null, index+1];
    case TYPES.LEFT_BRACKET:
    	return parse_array(tokens, index+1);
    case TYPES.LEFT_BRACE:
    	return parse_obj(tokens, index+1);
    default:
     console.log(index, t, tokens)
    	throw new Error("unexpected type " + Object.keys(TYPES)[t.type] + " " + t.type); 
  }
  
}

We don’t loop over anything here. Because we will call ourselves recursively in parse_array and parse_obj .

Notice we return an array [value, nextIndex] because we want the value we parsed as well as the next index in the tokens array to look at next.

Parsing an array

function parse_array(tokens, index){
	const result = [];
  
  let idx = index;
  
  while(true){
  	const token = tokens[idx];
    if(token.type === TYPES.RIGHT_BRACKET){
    	return [result, idx+1]
    }
    
    const [value, newIdx] = parse(tokens, idx);
    
    result.push(value);
    idx = newIdx;
    
    if(tokens[idx].type === TYPES.RIGHT_BRACKET){
    	return [result, idx+1]
    }
    
    if(tokens[idx].type !== TYPES.COMMA){
    	throw new Error("Expected a comma");
    }
    idx++
  }
  
  return [result, idx]
}

So if we have an empty array the next token will be TYPES.RIGHT_BRACKET and we can return an empty array.

Else we are going to go through the tokens list until we hit a right bracket. Or an error. Calling parse will give us back the value from the next token in the list. This is what is the real magic. If we have a nested array we will recursively call our own function again.

We have error checking for a comma between values or closing bracket if it’s the last item.

Okay now how do we parse an object?

Parsing an object

function parse_obj(tokens, index){
	const result = {};
  let idx = index;
  while(true){
  	const token = tokens[idx];
    if(token.type === TYPES.RIGHT_BRACE){
    	return [result, idx+1];
    }
    
    if(token.type === TYPES.STRING){
    	if(tokens[idx+1].type !== TYPES.COLON){
      	throw new Error("Expected colon after key");
      }
      const [value, newIdx] = parse(tokens, idx+2);
     	
      result[token.value] = value;
      idx = newIdx;
      if(tokens[idx].type === TYPES.RIGHT_BRACE){
      	return [result, idx+1];
      }
      if(tokens[idx].type !== TYPES.COMMA){
      	throw new Error("Expected a comma"); 
      }
    } else {
    	throw new Error("expected key");
    }
    idx++;
  }
  
  return [result, idx];
}

As you can see it’s really similar to an array. But we need a key value pair. We also check for there to be a colon : between key and value. ( "key":"value" ). Everything else is about the same as the array.

Wrapping it up.

Okay so what does parse return at the very end?

It will be an array with the first value being the object we parsed from JSON and the second should be the last index in the tokens array.

eg [{}, 2]

To call parse we need to pass it the result of our lex

const tokens = lex(inputString);

const value = parse(tokens, 0);

That’s not the most user friendly so let’s just wrap up this mess in a simple function that will just return the parsed object.

function parseJSON(text){
	const tokens = lex(inputString);
	const value = parse(tokens, 0);
	
	return value[0]
}

Easy!

Here is the full source as a JSFiddle: https://jsfiddle.net/pebj04zv/1/

I did have some help with this from an article written for python: http://notes.eatonphil.com/writing-a-simple-json-parser.html

checkout https://dropconfig.com to support me if you want.

This post is also available on DEV.