If you have been following my blog you have probably noticed I haven’t been writing, or in fact doing anything publicly in the last few months. A lot happened in my personal life, for example I moved to Santa Clara, but more importantly I was buried with work.
It is weird how I always read about other people working insane hours at startups, but it hadn’t really happened to me. Sure, especially early on in my career I did put in a few all nighters, but not consistently. This time it was different.
I had been working on a large project for several months, of which I was the main architect and developer, and we finally deployed it into production. And all hell broke loose! Some problems could have been avoided by having had a better test setup (especially load testing), some was my inexperience dealing with heavily loaded system, some we could caulk to 3rd party software not quite ready for prime time, some to running software on VMs when we should have run on bare hardware, some to bizarre VM freezes (like minutes at time, which is obviously disastrous for a server), some lessons learned with Python GIL and multithreaded applications, plus various other bits and pieces and finally made worse by our steady and respectable growth in both number customers and sizes of data sets. I plan to write about some of that in later posts.
Since customers were being affected, and we didn’t see the full scope of problems in advance, I thought I would work extra hours and fix the issues. I was being optimistic, and assumed just the most pressing issue that was killing us was the one that would get us over the hump. This turned into days, then weeks, then months of working 60-70 hour weeks, 6-7 days a week on 5-6 hours of sleep, fixing issue after issue after issue. Combine this with the move to a new city and things were really crazy for a while. I don’t understand how I did it, given how little time we had to test on many occasions, but I believe I never introduced a catastrophic bug that got rolled into production (if you discount the initial deployment ;). Finally, just before Christmas, things started humming the way I had planned and how they had worked in our tests months before. Of course there is still a lot of work to do to improve things, but they can be addressed in a more sustainable pace.
Luckily the work I was doing was very interesting, or I wouldn’t have been able to do it for such a long period. In retrospect I can now finally say that I have now personally faced the problem of scaling, and while I’ve always said it is a good problem to have, I now also realize trying to solve it can easily lead to exhaustion because the pressure is huge to solve the issues quickly.
While I learned a lot of things and will be able to avoid some of the issues in the future, this also clearly showed what kind of testing we still need to do better. Besides being better for users of the software, it is also better for developer health and sanity…
I am still recovering from that ordeal both mentally and physically, but I have been feeling much better. I do realize I am suffering from some burnout, but I am also starting to get the itch to continue developing my own software, both free and paid. M2Crypto release has been pending for around 6 months now, and my Android apps are really in need of an upgrade.
I am also months behind in some personal correspondence. So sorry! I will try to get my inbox in order in the next couple of weeks as well.