FrankTheDevop FrankTheDevop FrankTheDevop FrankTheDevop FrankTheDevop

Posts Tagged :

JSON

Script in Node.js to iterate a directory and extract information from it´s files

150 150 Frank

Hi everyone,

after we did the template last time, I want to show you how to put the single pieces together.
Based on a task at hand I choose the example of iterating and working with files in a directory.
The exact task was:
– Iterate a directory
– find all JSON files in it
– read them
– extract all objects in them
– extract the property email from them
– extract the unique domains of the email addresses
– count how often each domain occurs
– write this information to a summary file for further processing / display

'use strict'

const Promise = require('bluebird')
const fs = require('fs');
const path = require('path');
const util = require('util');

// Promisify only readdir as we don´t need more
const readdirAsync = Promise.promisify(fs.readdir);
const writeFileAsync = Promise.promisify(fs.writeFile);

// Commandline handling
const optionDefinitions = [
  { name: 'folder', alias: 'f', type: String }
]
const commandLineArgs = require('command-line-args')
const options = commandLineArgs(optionDefinitions)

// Add the path to your files
const folder = options.folder
// e.g. '/Users/$Yourusername/Downloads/customerdata';

// This will hold all entries from all files
//  Not unique
const all = []

// Read all files in our directory
return readdirAsync(folder)
  .then(files => {
    files
      .map(entry => {
        if(entry.indexOf('.') > 0 && path.extname(entry) === '.json') {
          // In case there are .json files in the folder that are not in JSON format
          try {
            const temp = require(path.join(folder, entry))

	    // I know for sure that all entries have an filled email column so I can just split here
            // and extract the domain name without checking
            temp.map(entry => (all.push(entry.email.split('@')[1])))
          } catch (e) {}
          return null
        }
      })

      console.log(all)
  })
  .then(() => {
    // Create a unique array
    // Use the new Set feature of ES 6
   return Promise.resolve([...new Set(all)])
  })
  .then(allUnique => {
    // Get a list of unique entries with the number of times it appears
    let newList = []

    allUnique.map(entry => {
      const t = all.filter(innerEntry => innerEntry === entry)
      newList.push({name: entry, count: t.length})
    })

    return Promise.resolve(newList)
  })
  .then(allUnique => {
    allUnique.sort((a,b) => b.count - a.count)
    return Promise.resolve(allUnique)
   })
  .then(allUnique => {
    let content = allUnique.reduce((a, b) => a + `${b.name};${b.count}\n`, '')
    return writeFileAsync(path.join(folder, 'all.txt'), content)
  })
  .then(data => {
    console.log('Wrote file successfully')
  })
  .catch(err => {
    console.log('An error occurred:', err)
  })

You find the repository for it here.

If you are looking for the explanation, continue to read. Otherwise be happy with the template and change it to your hearts desire ;).

This one is a bit longer but stay with me, we will go through it together.
At first we have the standard block where we import all required libraries in Line 3-6.

Then we convert the async, callback-based functions for readdir and writeFile to Promises (we promisify them) for easier
and more elegant handling in Line 9-10.

Next comes the handling of command line (CLI) parameters as we did before in Line 12 – 21.

We define an array all which will receive all email domains from the read files (not unique) in Line 25 .

Now we have everything together to start:
We read the directory content in Line 28.
In Line 29 it returns an array of all found files
With Line 30-31 we start iterating all elements of the array and with Line 32 with make sure that only files ending with .json are accepted, all others are ignored
Line 34-41 is a bit of cheating, Node.js is able to require an JSON file. So instead of reading the file, parsing it and having to handle it all myself, I use the functionality of require.
In case there is a JSON file that can not be parsed I wrap it into a try error, so that it continues on an error
Line 39 does a few things at once:
– With .map I iterate over all entries in the file
– I know each object contains an email property, therefore I act on it without checking
– An email address is in the form username@domainname.domainextension, I need the domain name and extension, so I split the email property and take the second half of the email address, which is the domain part
– Each of these domain parts is pushed into the array for further processing

After all the processing I make a debug output in Line 45.

JavaScript ES 6 introduced some nice new features, one is a Set (an “array” of unique values) and the decomposition operator. In Line 50 I return an new array that is created by decomposing the Set,
so in short: In one line I get an unique array of domains.

In the next function we create a new array of object with the domain name and the number of occurrences. For that we iterate over each entry of the unique array in Line 56,
use the filter method of the not unique array with all entries in Line 57. The filter method returns an array, so I can create the JSON object with the number of occurrences easily by using array.length in Line 58.

After I have the array with the number of occurrences I want to see it sorted. The sort function allows use to provide a function how to sort. And thanks to (5) & (6) I found an short way to do in as you see
in Line 64.

In the last function I use the array.reduce function to create a string from the JSON objects. You can see this post-processing step in Line 68.

All that is left is to write the data to a file as you see in Line 69.

This is followed by a simple message to signal that the script has successfully finished (Line 72) or the output of the error if one occurred in Line 75.

I hope I could help you save time again in your race against the clock and you found the explanations useful.

Yours sincerely,
Frank

Sources:
(1) How to escape Callback Hell
(2) Explanation of Node.js CLI Argument handling
(3) Explanation of Node.js CLI Argument handling II
(4) My own short example of an template for Node.js CLI Argument handling
(5) Sorting an array
(6) Sorting an array of objects by their property
(7) How to write to a file in Node.js
(8) How to avoid making mistakes with Promises
(9) Repository for the scrip
(10) Escape Callback Hell with Promises
(11) My article about how to convert (promisify) an async function with callback to a Promise based one