Archive for June, 2010

Throttling Asynchronous Tasks

I was recently tasked with load testing CouchDB. We have been using the document store very heavily for several new projects and we really needed to understand what it was capable of. More importantly, we needed to try and identify where it starts to fall apart. Knowing the performance characteristics of the software you are using is vital when making decisions about scaling.

The first step was to identify a large data set. I started to go grab the stack overflow data when a friend of mine pointed me to wikipedia. Now this is a set of data to toy with. It's an xml file which contains the more than 9.5 million pages which make up wikipedia. Let that soak in for a minute. Unzipped it weighs in at 25.4GB and I'm sure that it has grown since I downloaded last.

Time to play!
The general structure of the data looks like this:

    public class Page
        [JsonProperty("_id")]
        public string Id { get; set; }
        public string Title { get; set; }
        public string Redirect { get; set; }
        public Revision Revision {get;set;}
    }

    public class Revision {
        public string Id { get; set; }
        public DateTime Timestamp { get; set; }
        public Contributor Contributor {get;set;}
        public string Minor {get;set;}
        public string Comment { get; set; }
        public string Text {get;set;}
    }

    public class Contributor {
        public string Username { get; set; }
        public string Id { get; set; }
        public string Ip { get; set; }
    }

After identifying the structure, my next step was to actually parse the xml file. After a few minutes of banging my head against the wall, I ended up at this. I use an XmlReader to scan forward through the file and use the LINQ to XML bits to break off the page elements.

    public class WikiFileParser {
        string fileName;

        public WikiFileParser(string fileName){
            this.fileName = fileName;
        }

        public IEnumerable<XElement> GetPages() {
            var file = new StreamReader(fileName);
            var reader = XmlReader.Create(file);
            reader.MoveToContent();

            while (reader.Read()) {
                if (reader.NodeType == XmlNodeType.Element && reader.Name == "page"){
                    XElement x = XNode.ReadFrom(reader) as XElement;
                    if (x != null)
                        yield return x;
                }
            }
        }
    }

The reason I chose this method is that I needed to keep a minimal amount of data in memory and I wanted to use the convenience of the LINQ to XML api to populate my objects.

Before I get to the meat, I need to share a few extension methods:

    public static XElement Named(this XElement elm, string name){
        var newElm = elm.Element("{http://www.mediawiki.org/xml/export-0.4/}" + name);
        if (newElm == null)
            newElm = new XElement("dummy");
        return newElm;
    }

This one is to get around some weird namespace issues that I encountered. If someone knows how I can avoid this, please let me know!

    public static IEnumerable<IEnumerable<T>> Chunk<T>(this IEnumerable<T> pages, int count){
        List<T> chunk = new List<T>();
        foreach (var page in pages){
            chunk.Add(page);

            if (chunk.Count == count){
                yield return chunk.ToList();
                chunk.Clear();
            }
        }
        yield return chunk.ToList();
    }

This one is kind of interesting. I'm using this to batch up single entities into clumps so that I can perform bulk operations.

Finally, on to the meat!
I'll start by posting the code and the I'll explain each method in detail.

    public class BulkLoader{
        string uri;
        public BulkLoader(string uri) { this.uri = uri; }

        public void Load(string filename){
            Action<IEnumerable<XElement>> saveAction = Save;
            var file = new WikiFileParser(filename);

            var workers = file
                .GetPages()
                .Chunk(1000)
                .Select(x => saveAction.BeginInvoke(x, null, null))
                .Aggregate(new Queue(),
                           (queue, item) =>{
                               queue.Enqueue(item);
                               if (queue.Count > 5)
                                   queue.Dequeue().AsyncWaitHandle.WaitOne();
                               return queue;
                           });

            //Wait for the last bit to finish
            workers.All(x => x.AsyncWaitHandle.WaitOne());
        }

        void Save(IEnumerable<XElement> elms){
            var json = new { docs = elms.Select(x => MakePage(x)) }.ToJson(false);
            var request = WebRequest.Create(uri);
            request.Method = "POST";
            request.Timeout = 90000;

            var bytes = UTF8Encoding.UTF8.GetBytes(json);
            request.ContentType = "application/json; charset=utf-8";
            request.ContentLength = bytes.Length;

            using (var writer = request.GetRequestStream()){
                writer.Write(bytes, 0, bytes.Length);
                request.GetResponse().Close();
            }
        }

        Page MakePage(XElement x){
            var rev = x.Named("revision");
            var who = rev.Named("contributor");
            return new Page(){
                Title = x.Named("title").Value,
                Redirect = x.Named("redirect").Value,
                Id = x.Named("title").Value,
                Revision = new Revision(){
                    Id = rev.Named("id").Value,
                    Timestamp = Convert.ToDateTime(rev.Named("timestamp").Value),
                    Contributor = new Contributor(){
                        Id = who.Named("id").Value,
                        Username = who.Named("username").Value,
                        Ip = who.Named("ip").Value
                    },
                    Minor = rev.Named("minor").Value,
                    Comment = rev.Named("comment").Value,
                    Text = rev.Named("text").Value,
                }
            };
        }
    }

MakePage is responsible for turning the xml into a plain object. Here is where I needed that funky extension method to keep from typing the namespace over and over. There's not much to see here.

Save is responsible for persisting the object to CouchDB. Alex Robson will roll his eyes when he sees that part. He has written a tremendously awesome .NET CouchDB API called Relax which you can find on github. The reason I decided not to use his api is because I was trying to eek out every bit of speed increase that I could find.

Load is where the magic happens. This is the part that tripped me up for quite a while. At first I was using Parallel.ForEach and I kept running into out of memory exceptions. It wasn't until Alex put up this blog post that I saw where I went wrong. I used a similar approach, but with plain old IEnumerable instead of IObservable / Reactive Extensions.

Then, Alex decided to one up me the other night before I even finished writing this post. You see, the problem with the original solution is that it would collect 5 asynchronous tasks and then wait for them all to finish before starting 5 more. My implementation suffered the same problem.

Sing with me; "Anything Rx can do, IEnumerable can do with more coooooode." I'm sorry that I just put you through that; my musical skills are definitely lacking. I really should get back to explaining code now. So, you see I start by looking at the stream of pages. I collect 1000 and then I fire off an asynchronous invoke of the batch save I describe above. The result of that is a handle to the asynchronous task which I can then use to track the status of it. I then aggregate that into a queue where will collect up to 5 tasks and then start dequeuing the oldest task and then waiting until it's finished.

The result? I can keep 5 tasks running constantly always pumping data. Since the data set is rather large and I'm using fairly good sized batches, this produces quite a bit of memory pressure. If you want to try this at home with not so much memory, then back down the batch size.

This was a fun task to do. Since I've already lost most readers somewhere about 15 paragraphs back, I'll save the performance for another post.

Will Tweet for Food

Some of you may know that I've recently switched jobs. For several months I tried to find just the right job using traditional job search mechanisms and was failing miserably. I got a few interviews, but nothing really clicked.  I didn't want to get stuck in some soul sucking corporate dev job wasting away in a cube while writing COBOL.  Unfortunately that's what I kept finding.

If you want something different, sometimes you have to look in different places. That's what I was thinking when I started seeking out local devs on twitter in the Nashville area. I started this quest about a year ago and I thought it would be entertaining to share with you how twitter found me a job.

The first step was to find some Nashville developers. This wasn't as easy as it sounds. You see, you don't just go out and do a search and come up with 20 or 30 people by location. Twitter doesn't quite work that well.  It really started with Elijah Manor.  I stumbled upon him one day from a retweeted tech tweet of his. I saw that he was from Nashville and thought that was neat, so I followed him and started looking over the people he followed. From there I found a few more local developers.

The next step is to wait and watch. This is really just a continuation of step one. The only way to find people to follow is to see how people interact with one another. Some of the Nashville tweeple I had found were duds while others were awesome.  Some of them interacted with other local devs. Jackpot! Over the course of a few months I picked up a few more local people who I thought were interesting.

The third step is to interact.  Twitter doesn't work unless you are producing output. Twitter will let you virtually stalk someone, but I doubt that would land you a job. "Hello sir, you don't know me, but I've been following you for a few months..." That just won't help any. No, you just need to put yourself out there.  Talk to other devs about interesting stuff.  Build relationships.  Try not to be stupid or controversial.  It's that easy.

The fourth step is to wait (again).  At this point you aren't looking for a job, you are waiting for opportunities.  While you are having fun interacting and learning from others just keep an eye open for goodies.  When you see something, go after it and see where it takes you.

Where it took me was two different opportunities. I can't believe it, but not only did twitter get me in the door with two companies, but it got me two, count them, two job offers.  Both of those opportunities I thought were very good and would have suited me well.  Traditional job searches got me in the door, but it was with the wrong companies for my needs. The interviews I got via twitter were very laid back and natural.  They were more of a friendly chat than anything else.

After a couple of brief discussions with Alex Robson and Dave Purdon about how I got hired and why they chose me I've come to a few conclusions. Alex told me that he felt that using social mediums allowed them to find like minded people.  I share the same feelings towards my employer. Being able to interact with potential co-workers well in advance of my hiring showed me which employers treat their employees well. Dave said that they felt comfortable with me because they saw that I loved what I did and used my own time to research and explore software development. From my point of view, I felt that a company that allowed (or even better, encouraged!) their devs to use twitter would be a good fit. It showed me that they promote openness and are current with trends.

I just want to let everyone know that you have options when it comes to finding a job.  There is certainly more than one way to do it.  The most important thing is to always be looking for opportunities.  Sometimes they might appear in places you aren't expecting.

Slight Adjustments

Tonight I set out to fix a drain in the front bathroom.  This was one of the things on my todo list since we moved into our new home over a month ago.  This tub is where we've been giving Hudson his baths and the lever style drain stopper wasn't completely plugging the drain.  Hudson likes his time in the tub, so naturally the water draining out slowly was a problem.

With me being the person that I am, I went to the hardware store and bought one of those push style drains to replace it.  I installed one on the old house, so I knew this wouldn't be an issue.  I never even investigated why the old one leaked, I just knew I was going to tear the old one out and put in a shiny new one.  I removed the two screws holding the overflow plate (the plate with the lever in the middle) and pulled out the linkage and stopper.

I'm about to start removing the drain plate when it hit me: there is a threaded rod on that linkage I just threw in the garbage. That would let me move the stopper up and down and adjust how deep it goes into the pipe.  So, for shits and grins I turned the threaded rod out a few turns and stuck it back in. Amazingly it worked. No more leaky drain.

I could have punched myself. I moved the rod maybe an eighth of an inch to fix the problem. That's it, 1/8 of one inch. I had spent $20 on the replacement. I drove a few miles to the hardware store and burned a little gas. I wasted half an hour going to pick up the parts.  I did all of that to solve a problem that was only an eighth of an inch long.

So many times my solution is to find the biggest hammer and beat until the problem goes away. Tonight I was reminded that sometimes I just need to take a moment to see what I have in front of me. The best solution isn't always to rip something out and replace it. Sometimes a little tweak is all that is needed. 

While tonight I am talking about plumbing, this really applies to life in general.  How much of your life have you tried to replace when all you needed to do was make a slight adjustment?  Even worse, how much of your life has made you unhappy and you've just let it happen without a second thought? Maybe a small tweak is all you need to restore a little bit of your sanity.  Think about it.