When I read the news today, I just had to share this with the Blog – This is such a good example of when things don’t go as planned.
Today there’s been big news-flashes on several Finnish news sites about the local train monopoly firm and their new web-based ticket sales system crashing under a heavy load. They had introduced a new online ticket sales system, and where literally overrun. The official comment from the company was (translated) “we where surprised to see such traffic amounts”.
A very welcome small detail in the official apology about the crashing system:
“The new ticket system got a 20 times higher than usual rate of sales, resulting in some 800 clicks per second”
They also said that it was fixed 2 hours later by adding more hardware, but at 10pm (~12 hours later) I still wait for 1m 16s (according to firebug) to load the page, something I think is WAY too slow.
Assuming a click is a page-load, it would mean that 800 page loads/sec x (1+68) page+resources = 54 400 HITS/SEC !! I think not.
What they probably mean is 800 HITS/sec, and that would make roughly 800/68 = 11 page loads/second. But Wait a minute, Only 11??
As I have no idea of how the system is built, other than it’s running on NGIX web-servers, I can’t comment on how fast/slow it should be, but from my experience 11 page loads per sec seems to be a little low. Also if 11 page-loads/sec was peak traffic at 20 times the normal, it means the Normal traffic is only 11/20 = 0.55 page-loads/sec (37 hits/sec).
EDIT: Got more info on the sales figures. They have sold 10K tickets/day (peak), where normally selling around 5000/day.
Calculating Tickets per sec: 10K/day = 416/h = 7/m = 0.11/sec
My conclusions about all of this
So my conclusion is that the system was probably more or less built according to specs given, specs that most probably lacked any performance requirements in any meaningful way. The reason for the Perf reqs missing is most probably inexperience on the part of the people who wrote the specs, specifically of what performance is and what it really means. I’ve seen this in almost all the new projects I’ve been involved in.
I also suspect that proper performance testing of the system has not been done.
In another news-article I read that the system has been developed for 2 years, and during that time there should have been plenty of time to performance-, stress- and load -test the system. Fix the problems and retest again until it reaches a performance that is acceptable.
Estimating the Expected Load
The following calculations are based on rough guesses on my part.
As a base rule of thumb I usually use the following formulas for estimating pages/sec (when it’s a public web site, or the interface is used by the public (pay booths etc.)):
Average Concurrent Users (ACU) = Total Expected User Base (all potential users) * 1%
Average Concurrent Pages (ACP) = ACU * 1%
So I can only speculate, but in Finland we have 5.3 Million people, of which 15-65 year old people are ~3.5million, of which 1% is 35000 potential concurrent people wanting to buy tickets. I assume that about 1/3 of these people would use the trains and about 1/3 would by tickets online (there are about 2 million privately registered vehicles in Finland).
So ACU becomes 3.5Million*0.01 = 35000*0.33*0.33 = ~3811
ACP then becomes 3811*0.01 =~38 pages/sec. (1% of online users are there clicking at the same time)
These numbers may seem high to some, but remember the potential user base really is 3.5 million users.
In comparison, using 0.55 pages/sec would give us: 0.55/0.01 = 55/0.33/0.33 = 505/0.01 = ~50505.
My way of estimating the average load probably gives a little higher result than theirs .. but remember, it’s just estimations … 🙂
So how much more should the system be able to handle at peak hour? This is a really difficult question and is totally dependent on the system and how it’s used. I estimate that the an online train-ticket system should be able to withstand a 30 fold short time peak, since some seasonal tickets come on sale on specific dates (We now know that the actual peak load was at least 20 times, and the system crashed after that).
So my estimation would be (worst case scenarios):
Avg load = ~38 pages/sec
Peak load = ~1140 pages/sec
And the Solution?
The only real solution to this is to get specialized professional help from the real pros. People like architects, performance testers, and test managers, who know what they are doing. It may be more costly than doing it in-house or with a larger consultant vendor, but if and when the results of a go-live launch are as described above, I think the money spent on pro’s would have been far less than the money now needed to roll back to the old system in production, retest the new system in QA, find and fix the issues, and perhaps even redesign parts or all of how it works internally.
At the very minimum the people doing the specs should get the pros to help out during the specification phase.
And where to get the professional help? Especially f the large consultant vendors don’t seem to be able to provide?
Answer: Find smaller specialized consultant companies that are experts in the field – My company Celarius is one such company 🙂
If you read this far, then perhaps you would like to comment: How would YOU have estimated the load?