Size Matters: Small User Stories

I've recently joined a team which is in the process of adopting agile/lean methodologies.  This has been somewhat of a challenge since the rest of the organization seems to follow a waterfall approach.  To the rest of the company, we're just a bunch of outlaws slinging software and so it's been difficult to get the buy in from what I understand. We're working towards it, and that's all that matters to me.

I came from a shop which was doing this process quite well.  Prior to that I was in a place where it was pants on fire all the time.  Given the choice, I much prefer the agile process done well. Software got churned out faster and there was less bugs.  Everyone seemed to have a firm grip on a wide variety of the product - there was less of the development silos.

"I want a thing that does a thing like this other thing but different, ya know?"
First and foremost: requirements specifications are important.  The place that had no process was pure insanity.  No one knew what was going on because they didn't really know what was being built.  That lack of knowledge was due to very broad and vague requirements. Without the requirements being specified to any granular level resulted in poor communication and a lot of waste in the form of time re-explaining things.

It sounded so simple when you explained it forever ago...
Big stories are almost as bad as not having any stories.  It's hard to estimate a big story because there is a lot of unknown and a lot of room for change.  The surface area is too large.  By the time you've iterated a big story, the scope is likely to have changed quite significantly from the original simple explanation presented as a afterthought at the end of that meeting you had 3 months ago.  Not only that, but the outside perspective of your progress will seem very slow and your early time estimates will be way off.

What is the smallest deliverable I can provide and still add functionality?
When you've narrowed down your efforts to a very small scope, you have removed surface area for change in the time that it takes you to iterate. That's not to say that feature won't change, because it will.  Once the sprint is done, it's out of our minds and we're moving on.  Change building on existing functionality is good because it builds a better feature.  Change in the middle of implementing functionality is just frustrating as hell.  You just wrote something that no one will ever see.  If they didn't see it, then did it even happen?

That feature demo is closure for you and for the people who sign your paycheck.  They know what you've been up to because they saw it with their own eyes.  They then consciously made a decision to alter direction on something they witnessed work.  Changing something that exists is a lot harder to swallow for a business than changing something that doesn't. I find myself not arguing for a features to be implemented in a certain way anymore because I have less emotional attachment to them once they've been seen.

The other good thing about demoing the smaller deliverables is that your perceived velocity is a lot higher. This is a good side effect because the last thing I want my boss questioning is what I've been doing. For me it also helps keep morale higher because I can get something done fast, get a small pat on the back and then move on.  I stop worrying about where things are going and start worrying about getting the next piece done.

For me, I think the smaller stories provide real velocity gains and not just the perception of such.  I spend less time trying to figure out how the pieces fit together in advance.  Instead worrying more about how this feature fits into what exists already.  If I hit a sticking point, then I know that I need to do some refactoring.  Refactoring code that exists and has tests is much easier than pre-planning all of the what-ifs.

This is all very new to me still.  It's taken me a while to detox from having daily fires to put out. I'm actually able to focus on building better software instead of running around the office with my arms flailing above my head.  As I'm getting to spend more time in agile/lean processes, I'm starting to see big gains in my productivity and mental health. :)

One Year Ago

Wow.  I can't believe this, but it has been a full year since my son Hudson was born.  A lot has changed since then.  I've changed jobs twice, we've moved to a new house, and my wife is now staying home to take care of Hudson. From a tech standpoint, I've given up quite a bit of my non-work time to spend time with my family.  That means less frequent blog posts and less work on my open source projects, but I think it's worth it.  He's only going to be this age once.

The other day I was looking back at his early pictures and I can't believe how much he's grown.  It's just an incredible thing.  Those of you with kids know the feeling I'm experiencing right now. For the rest of you, saying that I'm awestruck would be an understatement. The time has flown by so fast, and I'm enjoying my opportunity to be a father.

The Importance of Liquidity

Not too long ago I blogged about how I found a job in an unconventional manner. The dev team I ended up working with was top notch. I was able to work with them for two months and I learned so much in such a short amount of time. Unfortunately the company that was supporting said dev team was a little less than stable. I'm happy to say that I've moved on and have found employment elsewhere, but that's not what I want to talk about tonight.

Getting booted on my ass revealed how well our family had prepared for tough times.  My wife and I have made it a point to live within our means and save money for a rainy day.  While people around us have been buying fancy new cars and big homes filled with a lot of toys, we lived in our small house and drove older vehicles to save money. When our son, Hudson, came into the world, we decided to upgrade homes, but still managed to be reasonable with our home purchase.

We bought a new home and decision time came.  Do I put a big fat down payment on the new home to keep the monthly payments low or just take advantage of the super low interest rates our shitty economy has bestowed upon us?  The problem with the big fat down payment is that I would have to give up a large chunk of our savings.  In the end I compromised and met somewhere in the middle.  I felt confident that I had enough remaining for my family to survive a while should bad things happen.

Bad things happened.
About a month prior to my previous employer's implosion, we decided that my wife should stay home with our child.  This was huge for both of us.  My wife has worked since she was 16 and staying at home was a huge lifestyle change for her.  For me, it meant some added pressure to provide for my family. We were up for it and took the plunge. Naturally now that we're solely
dependent on my income is when I would get laid off.  That's just the way my world works.

Is my World Ending?
As it turns out, losing my job wasn't the end of the world.  We had at least 6 months of funds to float us until I could find more work.  Granted, that's not 6 months of eating steaks every night, but it's still 6 months of mortgage, utilities,  health insurance, groceries, etc.  With that under my belt, I was able to go out an look for jobs confident that I didn't have to take the first thing that came my way.  I was able to pick where I wanted to go so that I could do what's best for my family.

This whole experience has just validated my theories about how our family's finances should be managed. I feel good about how things have turned out. Let my situation serve as an example for you. I would say that you need at least 3 months of money available "just in case." You just never know when the FBI and IRS will pay your company a visit.


CouchDB Bulk Load Performance

Yesterday I wrote a very long (sorry!) post about my technique for bulk loading data into CouchDB. I didn't want to make that post any longer, so today I'm going to talk about how well the bulk loading performs. All of these numbers come from my machine which is a Core2 Quad Q9000 @ 2Ghz with 8GB RAM and a single 7200 RPM drive.

As you may recall, the main problem I had was getting too much stuff in memory and having everything blow out with an out of memory exception. The solution I posted yesterday keeps memory under control and seems to keep all four processors busy. Here's a screenshot of task manager while it is running.

Armed with a stopwatch and the CouchDB web interface, I did some crude timings.  I may go back at some point and wrap my loader with some timing code so that I can generate some minute-by-minute graphs, but this will do for now.

Elapsed Time (minutes)Database Size (gigabytes)Document Count
1.3319,000
2.749,000
31.089,000
41.4134,000
51.8181,000
103.6503,000
155.1843,000
206.71,201,000
257.91,473,000
309.11,759,000

I was really impressed with the numbers! After 30 minutes, I was averaging 977 documents/second and and 5.2 megabytes/second. Keep in mind this is all running local on my machine, but the numbers sure are encouraging.

Throttling Asynchronous Tasks

I was recently tasked with load testing CouchDB. We have been using the document store very heavily for several new projects and we really needed to understand what it was capable of. More importantly, we needed to try and identify where it starts to fall apart. Knowing the performance characteristics of the software you are using is vital when making decisions about scaling.

The first step was to identify a large data set. I started to go grab the stack overflow data when a friend of mine pointed me to wikipedia. Now this is a set of data to toy with. It's an xml file which contains the more than 9.5 million pages which make up wikipedia. Let that soak in for a minute. Unzipped it weighs in at 25.4GB and I'm sure that it has grown since I downloaded last.

Time to play!
The general structure of the data looks like this:

    public class Page
        [JsonProperty("_id")]
        public string Id { get; set; }
        public string Title { get; set; }
        public string Redirect { get; set; }
        public Revision Revision {get;set;}
    }

    public class Revision {
        public string Id { get; set; }
        public DateTime Timestamp { get; set; }
        public Contributor Contributor {get;set;}
        public string Minor {get;set;}
        public string Comment { get; set; }
        public string Text {get;set;}
    }

    public class Contributor {
        public string Username { get; set; }
        public string Id { get; set; }
        public string Ip { get; set; }
    }

After identifying the structure, my next step was to actually parse the xml file. After a few minutes of banging my head against the wall, I ended up at this. I use an XmlReader to scan forward through the file and use the LINQ to XML bits to break off the page elements.

    public class WikiFileParser {
        string fileName;

        public WikiFileParser(string fileName){
            this.fileName = fileName;
        }

        public IEnumerable<XElement> GetPages() {
            var file = new StreamReader(fileName);
            var reader = XmlReader.Create(file);
            reader.MoveToContent();

            while (reader.Read()) {
                if (reader.NodeType == XmlNodeType.Element && reader.Name == "page"){
                    XElement x = XNode.ReadFrom(reader) as XElement;
                    if (x != null)
                        yield return x;
                }
            }
        }
    }

The reason I chose this method is that I needed to keep a minimal amount of data in memory and I wanted to use the convenience of the LINQ to XML api to populate my objects.

Before I get to the meat, I need to share a few extension methods:

    public static XElement Named(this XElement elm, string name){
        var newElm = elm.Element("{http://www.mediawiki.org/xml/export-0.4/}" + name);
        if (newElm == null)
            newElm = new XElement("dummy");
        return newElm;
    }

This one is to get around some weird namespace issues that I encountered. If someone knows how I can avoid this, please let me know!

    public static IEnumerable<IEnumerable<T>> Chunk<T>(this IEnumerable<T> pages, int count){
        List<T> chunk = new List<T>();
        foreach (var page in pages){
            chunk.Add(page);

            if (chunk.Count == count){
                yield return chunk.ToList();
                chunk.Clear();
            }
        }
        yield return chunk.ToList();
    }

This one is kind of interesting. I'm using this to batch up single entities into clumps so that I can perform bulk operations.

Finally, on to the meat!
I'll start by posting the code and the I'll explain each method in detail.

    public class BulkLoader{
        string uri;
        public BulkLoader(string uri) { this.uri = uri; }

        public void Load(string filename){
            Action<IEnumerable<XElement>> saveAction = Save;
            var file = new WikiFileParser(filename);

            var workers = file
                .GetPages()
                .Chunk(1000)
                .Select(x => saveAction.BeginInvoke(x, null, null))
                .Aggregate(new Queue(),
                           (queue, item) =>{
                               queue.Enqueue(item);
                               if (queue.Count > 5)
                                   queue.Dequeue().AsyncWaitHandle.WaitOne();
                               return queue;
                           });

            //Wait for the last bit to finish
            workers.All(x => x.AsyncWaitHandle.WaitOne());
        }

        void Save(IEnumerable<XElement> elms){
            var json = new { docs = elms.Select(x => MakePage(x)) }.ToJson(false);
            var request = WebRequest.Create(uri);
            request.Method = "POST";
            request.Timeout = 90000;

            var bytes = UTF8Encoding.UTF8.GetBytes(json);
            request.ContentType = "application/json; charset=utf-8";
            request.ContentLength = bytes.Length;

            using (var writer = request.GetRequestStream()){
                writer.Write(bytes, 0, bytes.Length);
                request.GetResponse().Close();
            }
        }

        Page MakePage(XElement x){
            var rev = x.Named("revision");
            var who = rev.Named("contributor");
            return new Page(){
                Title = x.Named("title").Value,
                Redirect = x.Named("redirect").Value,
                Id = x.Named("title").Value,
                Revision = new Revision(){
                    Id = rev.Named("id").Value,
                    Timestamp = Convert.ToDateTime(rev.Named("timestamp").Value),
                    Contributor = new Contributor(){
                        Id = who.Named("id").Value,
                        Username = who.Named("username").Value,
                        Ip = who.Named("ip").Value
                    },
                    Minor = rev.Named("minor").Value,
                    Comment = rev.Named("comment").Value,
                    Text = rev.Named("text").Value,
                }
            };
        }
    }

MakePage is responsible for turning the xml into a plain object. Here is where I needed that funky extension method to keep from typing the namespace over and over. There's not much to see here.

Save is responsible for persisting the object to CouchDB. Alex Robson will roll his eyes when he sees that part. He has written a tremendously awesome .NET CouchDB API called Relax which you can find on github. The reason I decided not to use his api is because I was trying to eek out every bit of speed increase that I could find.

Load is where the magic happens. This is the part that tripped me up for quite a while. At first I was using Parallel.ForEach and I kept running into out of memory exceptions. It wasn't until Alex put up this blog post that I saw where I went wrong. I used a similar approach, but with plain old IEnumerable instead of IObservable / Reactive Extensions.

Then, Alex decided to one up me the other night before I even finished writing this post. You see, the problem with the original solution is that it would collect 5 asynchronous tasks and then wait for them all to finish before starting 5 more. My implementation suffered the same problem.

Sing with me; "Anything Rx can do, IEnumerable can do with more coooooode." I'm sorry that I just put you through that; my musical skills are definitely lacking. I really should get back to explaining code now. So, you see I start by looking at the stream of pages. I collect 1000 and then I fire off an asynchronous invoke of the batch save I describe above. The result of that is a handle to the asynchronous task which I can then use to track the status of it. I then aggregate that into a queue where will collect up to 5 tasks and then start dequeuing the oldest task and then waiting until it's finished.

The result? I can keep 5 tasks running constantly always pumping data. Since the data set is rather large and I'm using fairly good sized batches, this produces quite a bit of memory pressure. If you want to try this at home with not so much memory, then back down the batch size.

This was a fun task to do. Since I've already lost most readers somewhere about 15 paragraphs back, I'll save the performance for another post.

« Previous PageNext Page »