Running PhantomJs on Nodejs

PhantomJS is a great solution as a headless webKit to scrape and automate webform pages. Building a productivity tool, I wanted to run it on the nodejs server.

After going through some documents, I realise that they have "irreconcilable difference". The main reason behind was the JavaScript Engine they run. The concept of Js engine was very new to me, and it took me sometime to comprehend it.
Here's the link for more background info.

At the end, it was miserable to debug the whole process. It was not obvious to determine whether the command failed because of binPath, modules.exports or missing variables. On the bright side, I had a much better understanding of child_process in NodeJs.

todo: understanding exec Paths in execFile()

childProcess in NodeJs

To resolve the conflicts without creating a new instance to run it, Stackoverflow offers a workaround.

var childProcess = require('child_process')  
var cbinPath = require( path_to_exec)

var childArgs = [] // arrays of arguments ([add_phantom_script].js , 20, "string" ...]



childProcess.execFile(cbinPath, childArgs, function(err, stdout, stderr) {  
      // handle results 
      if (err) {
        console.error(`exec error: ${err}`);
        return;
      }
      if (stderr) {
        console.error(`std exec error: ${stderr}`);
        return;
      }
      console.log(`stdout: ${stdout}`);
    })

childArgs.length = 0; //empty childArgs [] after executions

Things to notice here:

  • err and stderr are passed in different parameters.
  • childArgs are passed as a list. All of them are treated as string arguments. Extra handling for strings with " " and numbers.
  • backtick (`) was used inside console.error with ${} to parse variables

Function
child_process.execFile(file[, args][, options][, callback])

execFile() as wrapper of exec()

execFile() and exec() are quite similar in nature,

except that it spawns the command directly without first spawning a shell.

Let's put two commands side by side:

childprocess.execFile(file[, args][, options][, callback])
vs
child
process.exec(command[, options][, callback])
where command = 'execphantom.js phantomsample.js | wc -l' (space separated string).

Things to notice here:

  • exec() doesn't allow arguments
  • It allows exec() to execute multiple commands as compared to execFile() and spawn()

todo: compare performance gain of skipping first spawning a shell

spawn() vs exec()

Differs by how they return the stdout.
buffer vs stream

Things to note:

  • The exec will execute the command in a shell which maps to /bin/sh (linux) and cmd.exe (windows)
  • exec should be used with caution as shell injection can be exploited. Whenever possible, execFile should be used as invalid arguments passed to execFile will yield an error (from DZone)

Async vs Sync

These objects implement the Node.js EventEmitter API, allowing the parent process to register listener functions that are called when certain events occur during the life cycle of the child process.

Majority of NodeJs API are idiomatic asynchronous programming pattern, and so are the aformationed commands.

However for some general purpose scripts without concerns of performance, using synchronous commands reduce the complexity of managing different processes.

child_process.spawnSync(), child_process.execSync() ,child_process.execFileSync() are examples of synchronised versions of other child_process commands.

Consequences:

  • block Node.js event loop
  • pause execution of any additional code until the spawned process exits
Further reading:

phantom-node bridge DZone NodeJs documentation Download files using NodeJs