Using PhantomJS at scale

About a year ago SmugMug had a dilemma. Our upcoming site-wide redesign and refactor  (aka the new SmugMug) moved all of our rendering code into client-side JavaScript using YUI. We had a problem; SEO is critical for our customers and search engines couldn’t index the new site.

Possible solutions were thrown around: do we duplicate our code in PHP? Use an artificial DOM? What about PhantomJS? Duplicating code would be a monumental effort and a continued burden when writing new features. Initial tests of fake/artificial DOMs proved unreliable. A small prototype Node.js web server that hooked into PhantomJS proved promising. Node.js’ async model would be perfect for handling things that wait for I/O like rendering webpages. We came up with the project name ‘The Phantom Renderer’ soon after.

The prototype

I spent a few days whipping up a prototype proxy server to test with that worked like so:

  • Node.js web server accepts a url in the querystring
  • Send that URL to a newly-spawned PhantomJS process that listens on stdin
  • PhantomJS fetches the page, we wait 500ms after the last HTTP request is sent to get the rendered content via the page.content property
  • Send content back to Node.js
  • Send content back to search bot

We thought we had a fairly simple and working solution.

The Reality

While our prototype worked (mostly), we knew we had a lot of work to do. Our pages were complex JavaScript applications with many HTTP requests and expectations that they would live in a ‘traditional’ desktop browser. GoogleBot sometimes would crawl us at over 500 reqs/s. PhantomJS can be CPU and memory intensive (and randomly crashes or freezes). We had to be absolutely sure we were sending back fully rendered pages.

Problem 1: When is a webpage ‘complete’?

In our prototype app we assumed that a webpage was ‘finished’ 500ms after the last HTTP request had begun. As you can probably already guess, this is incredibly naive. Our site loads dozens of images, scripts and stylesheets (not to mention lots of analytics code). Some load instantly, some take > 500ms to return content. What happens if a request completely fails? If the page is redirected (301, 302 or even via JS/meta tag)?  404s? We had to handle all those cases appropriately and gracefully.

At first, we had many pages that looked like this after ‘rendering’:

blank page

Obviously, this wasn’t going to work.

Through a lot of manual testing and QA we eventually came to a solution where we tracked each and every HTTP request PhantomJS makes and watch every step of the transaction (start, progress, end, failed). Only once every single request has completed (or failed, etc) we start ‘waiting’. We give the page 500ms to either start making more requests or finish adding content to the DOM. After that timeout we assume the page is done.

Once we did that, we had a 100% success rate for rendering pages and saw pages that looked like this:Screen Shot 2013-12-03 at 5.11.05 PM

Much better! But we weren’t out of the woods yet…

Problem 2: PhantomJS and Node.js Bugs

meme-bugs-1

Getting PhantomJS to render pages correctly during testing was a lot of work, but dealing with PhantomJS bugs made tear our hair out on occasion. When you are dealing with > 500 requests/second you uncover sporadic, random bugs that most people don’t. Also we are using a large percentage of the PhantomJS API, which means we are more likely to hit bugs or undocumented behavior. We also were new to PhantomJS so there was lots of user error 🙂

Some of these fun bugs and problems we dealt with were:

  • If PhantomJS got in a redirect loop it would hog all CPU and rapidly fill up memory until it crashed itself or the server it was on
  • Random ECONNRESET errors from child processes upon termination
  • Small percentage of PhantomJS processes simply not returning
  • PhantomJS’ onResourceRequested and onResourceReceived returning different URLs for the same resource due to url encoding. This causes problems if you are tracking requests.
  • Expecting PhantomJS processes to terminate cleanly. Instead tell it to exit, then kill the process. Double tap!

Problem 3: Scaling PhantomJS and NodeJS

LrsTa

Since this was a brand new project and we knew rendering web pages was CPU intensive, we spent a lot of time running benchmarks (and learning how to benchmark).

Our testing infrastructure consisted of a test Phantom Renderer box and a separate server running http_load that was used to send varying amounts of traffic. We created a list of 600 public gallery urls from our most popular customer sites and repeatedly slammed our test server with varying load to determine the best combination of processes, CPU and RAM.

It’s important to also document raw number of requests/sec and response time. A server isn’t very useful if it can handle hundreds or thousands of requests/sec but takes far too long to complete them.

When performance testing we learned a few things:

  • Don’t test against your normal QA/test environment. This will make your QA and dev teams unhappy.
  • Do make sure the any dependent services can also hand additional load/traffic!
  • Do use as close to production workloads and data.
  • Do repeat your tests multiple times to allow for services to ‘warm up’.
  • Do test multiple configurations (number of processes, max connections, etc) on the same hardware.
  • Do write down all your results and extra data
  • Do test for long periods of time (hours at least). You’ll probably uncover issues that won’t occur during a short performance test.

We also had a few problems scaling PhantomJS once it was in production and running for long periods of time:

  • Setting PhantomJS’ cache size too big, causing all 64 PhantomJS processes to slam the disk with reads and writes when the cache filled up and needed items removed.
  • Running too many PhantomJS instances, filling up RAM over a period of a few hours and causing processes to be killed.
  • Node.js’ Cluster module on Ubuntu not load balancing equally between processes, causing server CPU to be underutilized (fix is to put HAProxy in front of Node.js)
  • Setting too high of a limit on number of connections on our HAProxy servers, overloading our servers.

We also spent some time optimizing PhantomJS to load pages quickly by turning off image loading, allowing it to use a small disk cache and keeping the PhantomJS processes alive instead of respawning them for every request. We also spawn a separate Node.js process for each processor core, allowing for massive parallelization.

The importance of logs

Logs

During testing and tuning Phantom Renderer, we developed one strong habit; log everything. When we first started the project, we had no logging whatsoever. Debugging issues was easy at first when the codebase was small, but once it grew in size and complexity debugging became much more difficult. When Phantom Renderer was being tested it was difficult to determine the cause of bugs and errors (or even what PhantomJS and Node.js was doing).

About midway through the project we started using Winston, a great logging utility for Node.js. With Winston in place we added logging to every single step of the render process in PhantomJS and HTTP process in Node.js. We also used Winston’s log levels to allow for different levels of logging for debugging and production. Combining that with Splunk gave us deep insights into how specific requests were handled and how often certain errors were occurring in production. If you’re starting a new project logging should be a required piece of it.

The future of The Phantom Renderer

We’re hoping to open source The Phantom Renderer sometime in the near future. Hopefully it will be useful for web apps that have a mix of different frontend and backend technologies. Let us know if it’s something your team or company is interested in using!

We’ll be posting more in-depth posts about our experience with PhantomJS and NodeJS. Stay tuned!


Logs photo By Aapo Haapanen from Tampere, Finland (Logs) [CC-BY-SA-2.0 (http://creativecommons.org/licenses/by-sa/2.0)], via Wikimedia Commons

26 thoughts on “Using PhantomJS at scale

  1. SmugMug December 17, 2013 / 2:23 pm

    Reblogged this on The SmugMug Blog and commented:
    Did you know that SmugMug has a separate Sorcery blog? From time to time our Engineers will take you behind-the-scenes of something we’ve just built, or share some of the valuable insights we’ve had while investigating a particularly tricky fix. If you’re interested in the technical nuts and bolts of what powers our site or if you just want to hear what’s making our hearts beat a little faster, check it out! Their latest post deep-dives into The Phantom Renderer, or “how to make sure SEO works in the new SmugMug.”

  2. aanton December 18, 2013 / 10:54 pm

    Here interested in seeing code snippets 🙂

  3. Erez Rabih December 19, 2013 / 2:26 am

    Hi, very nice writeup.
    We at FTBPro.com also had a redesign to make our site a singlepage application.
    We use PhantomJS to render pages using a gem we’ve written called phantom-manager – you can check it out here: https://github.com/FTBpro/phantom-manager
    The concept is using nginx as a load balancer to multiple PhantomJS processes while watching each of them closely and restarting them if they don’t respond well.

    I’m happy to see other companies use phantom as their rendering mechanism, this way more mature tools will become open sourced to deal with this wide ranged problem.

    • Ryan Doherty December 19, 2013 / 11:00 am

      Thanks! phantom-manager looks nice, I’ll definitely check it out.

  4. tmjustin December 19, 2013 / 9:12 am

    This is pretty cool stuff. I’d definitely be interested in more technical detail. One issue in particular that I’ve run into is phantom.js having a really hard time with pages that use require.js, and it doesn’t seem to be addressed very well anywhere on the web. page.open never seems to return a ‘success’ status, and the only workaround I’ve found is using a setTimeout function that waits 20 seconds before rendering the page (totally ignoring the status != ‘success’ returned by page.open). Something more elegant and reliable would definitely be nice.

    • Ryan Doherty December 19, 2013 / 11:04 am

      Thanks! We are going to post more technical details in future blog posts over the next month or two.

      We had similar experiences with the page.open callback, it does not give you the ‘true’ status of the page. We don’t listen at all to it and just watch the HTTP request, checking headers and status code. We use the onInitialized callback to inject JS into the page that checks for the document readystate. If it’s completed we call back to PhantomJS telling us the page loaded, if not we set an event listener on the ‘load’ event and fire a callback when the page loads.

  5. Yannick December 19, 2013 / 2:43 pm

    This is really an excellent article summarizing some of the pitfalls we also encountered while developing our service (http://www.seo4ajax.com). Working with a variety of site gives us also other surprises like problems with url encoding.
    Anyway, despite of these problems, PhantomJS is really a great piece of software!

  6. Juan Borras December 21, 2013 / 5:50 am

    Awesome behind the scenes. Could Some body down there make some pics (b&w) of this amazing team?

  7. Ross January 2, 2014 / 7:50 am

    Would be great to open source it. I’m a solo developer who has pulled out more than a little hair trying to make simple html snapshots with phantom and node, and still have massive performance problems (I wound up running it with ONE phantomjs instance, since anything else causes timeouts). So looking forward to your code!

  8. ipeychev January 24, 2014 / 4:38 am

    Great article! Did you guys already open source the Phantom Renderer?

    • Ryan Doherty January 29, 2014 / 10:55 am

      Not yet, working on it!

  9. Purnendu Das January 28, 2014 / 10:33 pm

    Did you use any Node module like PhantomJS in it? Like spawning child process?

    • Ryan Doherty January 29, 2014 / 10:55 am

      We used a few NPM modules, generic-pool, winston and node-syslog. We also used the built in child process module.

  10. Alex February 22, 2014 / 10:05 pm

    HI Ryan, this sounds great! We just spent a couple of months doing the same but we wrote it in python and pyWebkit. We are currently just started writing this in node.js and phantomJS to see if we can increase in performance. Would be really good if we could have a chat as it sounds you’ve made great headway already. Can you give me a buzz on email?

    Thanks
    Alex

  11. Brian Stanback April 3, 2014 / 11:13 am

    Great write-up! Our engineering team went through a similar experience when we wanted to make our Ember.js apps crawlable.

    This ended up becoming my pet project, which is open sourced here https://github.com/zipfworks/ember-prerender, and I can relate to many of the issues you’ve described. Our project is fairly coupled to Ember.js but it should be fairly easy to fork and modify for other frameworks.

  12. arnaudleroy April 21, 2014 / 2:15 am

    Thanks for all this information. On the logging side, what product did you used at Splunk ? (I looked at Winston and it’s obviously interesting, but Splunk’s web site is not very clear…)

    • Ryan Doherty April 21, 2014 / 12:00 pm

      We use Winston to get logs into syslog, which are then automatically sent to Splunk.

  13. WebchickNY June 4, 2014 / 8:48 am

    Hi Ryan, Just checking in to see if you were still planning to open source this… I’d be very interested in seeing how you were able to get this type of performance and stability with PhantomJS…

  14. David June 26, 2014 / 10:53 am

    Was this ever open sourced?

  15. David July 9, 2014 / 1:53 am

    Hi, Ryan, still have plan to open source it?

  16. David July 9, 2014 / 1:54 am

    I am very interested in it for personal project

  17. garriss July 9, 2014 / 4:47 pm

    Reblogged this on Expanded(UI) and commented:
    Wow, a lot of activity in this space! After rejecting a massive Require to Browserfy rewrite we’re working on this too.

  18. Chris September 17, 2014 / 11:47 pm

    I don’t get it, how did you add in the delay and everything?

Comments are closed.