GIT – This is the first entry in a short series I’ll do on caching in PHP. During this series I’ll explore some of the options that exist when caching PHP code and provide a unique (I think) solution that I feel works well to gain high performance without sacrificing real-time data.
Caching in PHP is usually done on a per-object basis, people will cache a query or some CPU intensive calculations to prevent redoing these CPU intensive operations. This can get you a long way. I have an old site which uses this method and gets 105 requests per second on really old hardware.
An alternative that is used, for example in the Super Cache WordPress plug-in, is to cache the full-page data. This essentially mean that you create a page only once. This introduces the problem of stale data which people usually solve by checking whether data is still valid or by using a TTL caching mechanism and accepting stale data.
The method I propose is a spin on full-page caching. I’m a big fan of Nginx and I tend to use it to solve a lot of my problems, this case is no exception. Nginx has a built-in Memcached module, with this we can store a page in Memcached and have Nginx serve it – thus never touching PHP at all. This essentially turns this:
Concurrency Level: 50 Time taken for tests: 2.443 seconds Complete requests: 5000 Failed requests: 0 Write errors: 0 Total transferred: 11020000 bytes HTML transferred: 10210000 bytes Requests per second: 2046.32 [#/sec] (mean) Time per request: 24.434 [ms] (mean) Time per request: 0.489 [ms] (mean, across all concurrent requests) Transfer rate: 4404.39 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 0 0 0.1 0 2 Processing: 6 22 19.7 20 225 Waiting: 5 20 2.6 20 40 Total: 6 22 19.7 20 225 Percentage of the requests served within a certain time (ms) 50% 20 66% 21 75% 22 80% 22 90% 24 95% 26 98% 29 99% 39 100% 225 (longest request)
Concurrency Level: 50 Time taken for tests: 0.414 seconds Complete requests: 5000 Failed requests: 0 Write errors: 0 Total transferred: 11024350 bytes HTML transferred: 10227760 bytes Requests per second: 12065.00 [#/sec] (mean) Time per request: 4.144 [ms] (mean) Time per request: 0.083 [ms] (mean, across all concurrent requests) Transfer rate: 25978.27 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 0 1 0.1 1 2 Processing: 1 3 0.3 3 5 Waiting: 1 1 0.3 1 4 Total: 2 4 0.3 4 7 Percentage of the requests served within a certain time (ms) 50% 4 66% 4 75% 4 80% 4 90% 4 95% 4 98% 5 99% 5 100% 7 (longest request)
What’s important to note here is how these figures will scale. To get these numbers I developed a very simple proof-of-concept news script, all it does is fetch and show data from two MySQL tables: news and comments. A more complicated application might result in only 100 requests per second or if something like WordPress or Magento as low 20 requests per second! The good thing is that with full-page caching the time required to fetch and display the data depends only on the size of the cached data. Therefore if your application is written to do full-page caching it will always be able to enjoy low latency and high concurrency.
Full-page caching does introduce some complications, though. As mentioned earlier the goal is to make Nginx serve the cached pages, as such we cannot perform any logic during the serving of the page. This means we need to handle invalidation of cached pages during the updating of the data they use.
To be able to invalidate pages it’s important that we understand what data we have to work with and how it relates to not only our pages, but also our code. We will be using a framework so we can create a few rules that will help us understand the whole system.
- The framework uses a three-tiered setup of controllers, libraries and templates.
- Controllers will dictate how to handle a request defined by the URI.
- Libraries will be used to access all data.
This is how most frameworks work, you have a few of the big ones which use a MVC pattern but such a setup will be largely the same. From these rules we can determine how the relationship between data, controllers and pages will be.
- All data will need an identifier. For instance if you have a news script you’ll need an identifier for “news” and “comments”.
- All controllers must specify which data they use by referencing the identifier.
So to recap. The goal is to invalidate the correct pages, to do this we need to know which pages use what data. gives us 3 important parts.
- The library that handles the editing of data, and therefore the invalidation triggering.
- The controller handles the requests based on the URI and therefore relates to the cached pages.
- The actual cached pages.
Finally, we’re unlikely to have only one of each, for instance often multiple controllers will be using data. To continue our news script example, we have a controller to fetch the news and a controller to generate a RSS feed of the news. Similarly a controller might generate multiple pages, for instance one page per news post to display the comments. Therefore we also need to consider the inter-data-relationships.
- One-to-many relationship between invalidated data and controllers.
- One-to-many relationship between controllers and pages.
Data & Controllers
Earlier we defined a rule that all controllers much specify which data they use. This is useful as it means we can create a dependency list between data and controllers. When data is invalidated we can do a lookup in the dependency list and see which controllers we need to tell about the invalidated data.
This solves the problem elegantly and with OOP we can define interfaces to force controllers to implement the required methods. If they don’t we can set a flag that prevents the data from being cached and they should work normally.
One possible downside to this is that you can no longer edit files on the fly. If you change the way data is used you will most likely need to regenerate the dependency list, therefore it becomes critical that you have a deployment process in place for all code changes. Personally I think this is required any way so it does not cause me any problems, however it is something that has to be considered.
Controllers & Pages
Websites are per their nature diverse, in this framework all requests are passed to a controller along with the URI. The controller then uses the URI to determine what data to use to generate the output. The problem here is that there is a huge range of options on how the controller might look and behave. It would be really difficult to define something like a dependency list as a controller might use multiple data sources which will update dynamically. This would require the dependency list to be updated every time new data was added, not really a feasible solution.
The easy scenario is where the page URI is directly related to the data. For example in our news script the URI /news/4/ might show the news post with ID 4. If a comment is added to this news post we trigger an invalidation on the comments data identifier. The library that inserts the data will know to insert to news post 4, therefore it can also pass this along when triggering the invalidation. This allows the controller to determine that the page /news/4/ needs to be invalidated.
The bigger problem is when data is used as part of a set defined by data not related to the updated data. A simple example here would be a search function. You have the controller search and the keyword “PHP” being searched for – the URI for this would most likely be /search/PHP/. When a news post is updated we pass along the ID to the controller but we have no way to determine which URI actually uses said news post. Keeping track of each search term is not feasible. There are a few options here but none that are really perfect.
- Don’t cache at all, data will always be current but might be CPU intensive.
- Increase caching granularity. Pass each request to PHP but cache the IDs of the news post and fetch the current data.
- Cache the full page using a time-to-live value. This means we have stale data for a bit but we keep high performance.
Ultimately it depends on your situation and what will fit best. I’d imagine I’d most often choose TTL caching or in case I need current data then increased caching granularity.
This covers the overall system, next time I’ll talk about how I’ve chosen to implemented this.