Monday, December 21, 2015

Learning About Large Json Files

When connecting to different APIs you might come across some large datasets that might not fit in normal text editors.  This makes it difficult to learn about how the data is structured.

There exists a really cool tool called jq that makes it a little easier to see the data structure.  It takes a while to learn the syntax, but it is pretty powerful.  Below are a few of the commands I use regularly to learn how my data is structured.

Assume we have a file names largeResponse.json.  That has a similar structure like:
{
  "data_available": true,
  "query": "The query is here",
  "results": {
    "details": [
      {
        "id": 1,
        "data1": "data1",
        "data2": "data2"
      },
      {
        "id": 2,
        "data1": "data1",
        "data2": "data2"
      }
      ... Many more rows here
    ]
  }
}

Getting the keys

 jq 'keys' largeResponse.json
This command will list all of the keys of the root object.  We can also see the keys of child objects using
jq '.results | keys' largeResponse 
Assuming there is a child object of 'results'  this will show the list of the keys in the results object.

Get an object

jq '.results' largeResponse.json 
will show all of the results object, which for this case would be a a LOT of information.  However, for smaller objects, like the query, it could be very helpful.


Working with Arrays

With a large list of data it is useful is see a sample of the data.  

Length. We can see the length an array with something like this
jq '.results.details | length' largeResponse.json
Notice how details is an array.

Get a single Object.  We can get an item from an array with the index.
jq '.results.details[0]' largeResponse.json
This would return 

{
  "id": 1,
  "data1": "data1",
  "data2": "data2"
}

Get a range of objects.  jq can also get a range in an array.
jq '.results.details[0,10]' largeResponse.json
This would return the first ten objects in the details array.

Tons more

And of course you can a million other things that I haven't come close to exploring.  To see all that you can do take a look at the jq manual.

Misc

To print out multiple fields:
jq '.users[] | .first + " " + .last'

1 comment:

  1. Nice post on parsing JSON. The only thing I dislike about JSON is that in order to determine schema, the entire file has to be parsed because it can change per object. This is the issue I've be running into recently on massive JSON files I'm parsing. The same issue exists with any open format language like XML.

    ReplyDelete