Dynamic Publishing: Bake or Fry?
I was reading Tim's MT 3.1 Dynamic Publishing Blues and it reminded me a topic I read almost two years ago -- Half baked and little fried -- about dynamic vs. static generation. The topic of Timothy's post is about controversy of recently introduced dynamic publishing in MovableType that uses PHP as its engine. While Ben Trott explained their reasoning behind the decision I'm not sure I agree with him.
In his analysis Ben describes several options and dismisses most of Perl related options as inefficient. Without considering speed of interpreters, I guess he is talking about startup penalty of perl interpreter that is called on every request. This cost is definitely there, but the solution reminds me advice to buy more hardware to solve a performance problem without looking at optimizing an algorithm that is being used. Let's look at it.
The stated problem of reducing latency and server load can be solved in several ways:
1. Expires and Last-Modified headers. The script can return these headers (along with ETag) to make the page cacheable. The tricky part is to set proper Expires header. It should be long enough to minimize calls to the server and short enough to allow those calls when the content changes. This can be achieved by giving different values to different pages or even doing something similar to what google bot does visiting more often pages that change more frequently; +10% on expiration and -50% on modification may be a good start. Expires/Last-Modified also works well with static content (images, client-side scripts, and stylesheets). Expires header can be set to a fairly large value; if a file is updated, it can be served using a new URL.
2. Handling of If-Modified-Since and If-None-Match. The script should return 304 if content hasn't been modified. It doesn't even need to have a copy of the page; all it needs to know that it hasn't changed. It may be as simple as one LastModified time per blog/site for rarely updated sites that invalidates all caches; or as complex as dependency tracking to know exactly what information was used to generate a page.
3. Local cache of generated pages. The dependency check still needs to be done, but the page may already be generated and served from a local cache (likely file system). At this point most people would point out that all this can be done by saving generated pages as static pages and have them to be served by a webserver using little bit of mod_rewrite-like magic. While that's true, there are still several things that need to be addressed:
- Expires values need to be configured and they likely to be static
- Any custom headers need to be configured
- No personalization is possible (new since the last visit and other similar things)
- Authentication requests may need to be handled separately
- No parametrized request: pagination, searches and the like.
Pages can be cached (they can even be compressed) along with their headers and served when necessary. While this may be a viable option in many cases, there is still a question of how this cache should be in/validated: dependencies can be checked on every request, or they can be checked when pages are added/updated/deleted.
4. Template fragment caching. While the script may not cache the entire page, it still may be feasible to cache some of page fragments, especially the most time consuming or most frequently used, as recently updated items or list of subcategories that are likely to be used across many pages. This requires tracking of what fragment uses what information, so they can, again, be properly invalidated, but this may not be as complex as it seems.
5. File/memory cache. While template fragments may not be cached, the information that is necessary for page generation can be cached in memory (applicable to mod_perl, daemon and similar server solutions) or in files (this works well for filesystem-based solutions like Blooki, Blosxom, and other file I/O hungry solutions). It is not necessary to cache all the information; in most cases modification date/time, title, and some meta information is enough.
6. Access optimization. This probably doesn't apply to MT, but it definitely applies to Blooki (which uses filesystem to store its information). Even when information is not available in a cache, it's still possible to optimize a process of getting this information. Blooki is super-lazy about getting the stuff it needs. First, it's driven by templates; if it's not requested by a templates it probably won't be processed. Second, it only read directories first without even stat'ing files in them. Then it only stat files if you ask for their modification times. And then it only reads their content if you ask for title, meta, or other information.
7. Direct access. If nothing else helps, then information has to be read and page has to be regenerated from scratch.
Now, back to the original question: was it worth it? Unfortunately, it's not clear from the description if Perl is being used at all when PHP-based rendering is used, but as far as I understand it is (please correct me if I'm wrong). Switching from Perl to PHP only addresses items 2 and 3; doing everything else still requires Perl interpreter (and hence startup penalty). Now, both 2 and 3 can be quite effectively achieved by using a local proxy/cache, which some users may already have and for those that don't it is a one-time deal and is much easier than PHP engine integration. In my opinion the asnwer is clear.