1. Why should we use caching ?

Basically, it takes time and processing power to dynamically generate each requested webpage, in PostNuke or any other dynamic website. This affects not only the webserver itself (load), but also the visitors of your website (response time). So it makes sense to try and reduce the effort (in time and resources) needed to present a particular page, especially if it's requested frequently like your homepage.

One of the ways to do that is by caching : you generate the webpage dynamically only once every so often, and you save the resulting HTML page in a "cache" somewhere (e.g. database, file, memory). Next time someone requests that page, you then simply retrieve the prepared HTML page and send it back, instead of regenerating the page dynamically again.

Obviously, you only gain something when it takes longer to regenerate the page than to retrieve it from cache - otherwise you might as well not bother. Based on performance tests, even simple pages like backend.php can profit from caching, let alone complex ones like the homepage...

2. What elements should we cache ?

There are different levels of caching that could be done in an environment like PostNuke.

The first (and most effective) way is to cache the whole page, based on the requested URL for instance. This has the advantage that lookups in the cache are pretty much limited to a single step, and that time and load are reduced to the strictest minimum. Of course, this comes at the cost of "space" - you'll have to store many whole pages, even though many of the building blocks (header/footer, menu, blocks, ...) will remain the same from page to page.

So a second way is to cache the HTML output of the individual building blocks, and to reconstruct the requested page from those cached building blocks. You still gain by the fact that it's the generation of those building blocks that takes the most time, not the action of putting them together, but time and load *will* be higher than with the first option.

Since PostNuke users may potentially have different themes for the same page, a third way of caching would be to save the block/module results *before* they're processed for final output by the selected theme. This further decreases the amount of "space" needed for the cache, but again increases time and load compared to the previous option.

To give you an idea of the impact of the different caching levels, here are the measured response times for the homepage of a fairly large PostNuke site. The test includes a mix of anonymous visitors (potentially cached) and registered users (uncached), and the occasional regeneration of the cached content :

    - no caching : 676 ms (median = 655 ms)
    - variable caching per individual block/item : 395 ms (median = 310 ms)
    - output caching per individual block/item : 351 ms (median = 283 ms)
    - output caching per group of blocks/module : 312 ms (median = 236 ms)
    - output caching of all page elements : 301 ms (median = 225 ms)
    - output caching of the whole page : 279 ms (median = 200 ms)

In this test, the server is able to serve 1.48 requests/second in the "no caching" situation, and 3.58 requests/second when doing "whole page caching".

[Note 1 : in this test configuration, there was a "fixed cost" of 179 ms to load the PostNuke core (for PN .713), so by stream-lining the loading of the PostNuke core in case of caching, we should get well below the 100 ms mark, even for pages cached in the database]

[Note 2 : the measured response times should be taken as a *relative* performance indicator, not as absolute values that will apply for your server configuration...]

For now, we'll continue on the assumption that we aim for page caching, and return to the issue of block caching later on.

3. What pages can be cached ?

In short, any URL that :

    - is frequently requested, and
    - takes a relatively long time/load to generate, and
    - is likely to remain identical
    - for a certain period of time
    - and for a certain group of visitors (with the same permissions)

is a good candidate for caching.

Statistics based on webserver logfiles show that even for large sites, as much as 70% of all page requests concern only the first top 25 pages, or as much as 60% concern the first top 10 pages, with the rest of the page requests spread out all over the site.

Obviously, this will vary from site to site and on the profile of your visitors, but most likely there will still be a very small set of pages that make up the majority of your 'hits'. Those are the prime candidates for caching...

4. How often should we refresh cached pages ?

In order to determine how well a caching system performs, people often use the notion of "Cache Hit Rate" : the number of times a page was served from cache, compared to the total number of hits for that page. For example, a cache hit rate of 25% means that 1 page out of 4 can be served from cache, and it doesn't need to be regenerated from scratch again. A cache hit rate of 75% means that 3 pages out of 4 come from the cache...

The figure below shows a simulation of the cache hit rate for the top N pages, with different caching strategies. The legend shows the total number of hits for each of the top N pages.

A very basic scenario for caching could use a timeout (since the last request) to determine if a cache entry can still be considered "valid" or not. The first block in the figure shows the cache hit rate for different timeout (t/o) values, varying from 15 seconds to 5 minutes.

As you can see, even for *very* short timeout values between requests, the most frequently hit pages can benefit from caching in a significant way...

Of course, this simple caching strategy doesn't fully take into account the dynamic aspect of the site : users/editors can modify content at any time, so it would be nice if they didn't have to wait 60 seconds or more before seeing their modifications appear. So a nicer caching strategy would be to reset a cache entry whenever someone undertakes an action (e.g. post a new article, add a comment, ...) that affects its content.

Simulating this based on webserver logfiles goes a bit far, so I took the more radical approach of invalidating the complete cache whenever any POST request was issued. This obviously represents a worst-case scenario, but as you can see in the second block on the figure, even under these very pessimistic circumstances, you can get good caching results for heavily loaded sites.

A better variation on the cache timeout (t/o) used above would be to force a refresh of cache entries at a specific refresh time (r) after they've been updated, instead of (or in addition to) updating them only when they haven't been requested for a certain time. This is shown in the third and fourth blocks in the figure. Again, the idea of invalidating the whole cache whenever a POST request arrives is too simplistic for realistic environments, but it gives you an idea of the worst-case scenario, with still an overall cache hit rate of 50% for the top N pages !

5. Caching with PostNuke

It doesn't really make sense to try and cache *all* possible pages that could be generated by PostNuke. So ideally, PostNuke should be able to keep track of what the top N pages are over time, and adapt its caching strategy accordingly. Or instead of doing this automatically, it could also show the PostNuke administrator the list of top N pages, and let him decide which pages to cache or not.

Another interesting option would be for the administrator to decide which modules (or module parts) should be cached. For instance, the homepage, the articles mentioned on the homepage and the index of Downloads, FAQs, Forums, Sections, Web Links, etc. are all interesting candidates for caching, since they are usually requested frequently, and don't necessarily change every minute (or even every other day in some cases). In order to optimize the use of the cache, each of these could also be assigned a different refresh time...

So far, none of this would be very difficult to implement. There are however two things that complicate the issue : notification of changes, and user permissions.

For pages whose content can change relatively frequently (like forum indexes), you have the choice of using a small enough refresh time that users (and especially publishers) aren't really inconvenienced by seeing "old" content, or to have some way of informing the caching system that a particular cache entry has been affected and should be invalidated.

For the second option, adding a specific function call like pnCacheInvalidateEntry("Forum_index_123") in a module when content is being updated for Forum 123 would be the "easiest" way of achieving this, but it would mean that any module writer (or user) who wants to make use of caching would need to add those calls at the right place (and with the right parameters). A nicer approach would be to make use of hooks called on any update of content, with the particular problem (besides the fact that this doesn't seem feasible at the moment ?) of knowing which pages will be affected by what updates. A third (more pragmatic) approach would be to invalidate cache entries for a particular module whenever one of its admin or update functions is invoked, and let the module writer (or admin) decide which functions should trigger a cache refresh by default (e.g. for Forums, any admin function, and the user function 'post'). I'm not familiar enough yet with the hook mechanisms to determine which approach would be "best".

The other issue is with user permissions. Anonymous visitors are easy to handle, because they all have the same rights, they can all see the same things, and often they generate the majority of page requests on a website anyway, so caching output for them certainly makes sense. But if your site is primarily a "members" site, or if you want to introduce caching for registered users too (after all, why should they get the slow-loading pages ?), you have the particular problem of determining if user X and user Y have the same set of permissions for a particular content, so that they could both see the same (potentially cached) page. And perhaps they can get the same "main" content, but they have some personalized menu or block, in which case caching on block level rather than on page level (cfr. paragraph 2 above) starts making sense again.

No definite answers on these issues yet - suggestions are welcome...

6. Other ways to improve performance

Combining caching with database and code optimization, output compression, 304 Not Modified headers etc. - see Performance Tips

7. More...

PostNuke Performance Analysis : Site Statistics - Performance Tips - Application Profiling - Caching Strategies