Phil Booth

Existing by coincidence, programming deliberately

Parsing individual data items from huge JSON streams in Node.js

Let's say you have a huge amount of JSON data and you want to parse values from it in Node.js. Perhaps it's stored in a file on disk or, more trickily, it's on a remote machine and you don't want to download the entire thing just to get some data from it. And even if it is on the local file system, the thing is so huge that reading it in to memory and calling JSON.parse will crash the process with an out-of-memory exception. Today I implemented a new method for my async JSON-parsing lib, BFJ, which has exactly this type of scenario in mind.

BFJ already had a bunch of methods for parsing and serialising large amounts of JSON en masse, so I won't go into those here. The readme is a good place to start if you want to read more. Instead, this post is going to focus on the new method, match, which is concerned with picking individual records from a larger set.

match takes 3 arguments:

  1. A readable stream containing the JSON input.

  2. A selector argument, used to identify matches from the stream. This can be a string, a regular expression or a predicate function. Strings and regular expressions are used to match against property keys. Predicate functions are called for each item in the data and passed two arguments, key and value. Whenever the predicate returns true, that value will be pushed to the stream.

  3. An optional options object.

It returns a readable, object-mode stream that will receive the matched items.

Enough chit-chat, let's see some example code!

const bfj = require('bfj');

// Stream user objects from a file on disk
bfj.match(fs.createReadStream(path), 'user')
  .pipe(createUserStream());

// Stream all the odd-numbered items from an array
bfj.match(fs.createReadStream(path), /[13579]$/)
  .pipe(createOddIndexStream());

// Stream everything that looks like an email address from some remote resource
const request = require('request');
bfj.match(request(url), (key, value) => emailAddressRegex.test(value))
  .pipe(createEmailAddressStream());

Those examples do not try to load all of the data into memory in one hit. Instead they parse the data sequentially, pushing a value to the returned stream whenever they find a match. The parse also happens asynchronously, yielding at regular intervals so as not to monopolise the event loop.

The approach can be used to parse items from multiple JSON objects in a single source, too, by setting the ndjson option to true. For example, say you have a log file containing structured JSON data logged by Bunyan or Winston. Specifying ndjson will cause BFJ to treat newline characters as delimiters, allowing you to pull out interesting values from each line in the log:

// Stream uids from a logfile
bfj.match(fs.createReadStream(logpath), 'uid', { ndjson: true })
  .pipe(createUidStream());

If you need to handle errors from the stream, you can do that by attaching event handlers:

const outstream = bfj.match(instream, selector);
outstream.on('data', value => {
  // A matching value was found
});
outstream.on('dataError', error => {
  // A syntax error was found in the JSON data
});
outstream.on('error', error => {
  // An operational error occurred
});
outstream.on('end', value => {
  // The end of the stream was reached
});

There's lots more information in the readme so, if any of this sounds interesting, I encourage you to take a look!