Codestock 2012

This past weekend I was given the opportunity to speak about CouchDB at Codestock in Knoxville, TN. This is a talk I've been able to give a few times, but this is the first time I've attempted to record it. I've pulled out a 10 minute clip where we walk through storing a fast food order in a relational database and then storing the same order in a document database. The video is rough because all I had was my pocket camcorder.

CouchDB Bucket Demo, Codestock 2012 from digitalbush on Vimeo.

Also, here are the slides for the whole talk.

The sample code for the note taking app and map/reduce are in this repository. The wikipedia demo can be found in this repository. I'm still trying to get my legs with this whole speaking thing, so your feedback is much appreciated. Codestock was a blast and I hope to go back next year!

devLink 2011

I was lucky enough to get subbed in last minute to give my CouchDB talk at devLink this past week. This was my first time speaking at a technical conference and it turned out to be a great experience.  I did the same bucket demo that worked so well at the user group and it was just as messy and just as awesome as before.

I'm guessing there was 40-50 people there and I had 30 people fill out response forms. The audience was really engaged and asked a lot of good questions.  The feedback was positive, so that's a bit of a relief. I really hope everyone enjoyed the talk as much as I enjoyed giving it.

The slides from my talk can be downloaded here. The sample app I demoed can be found on github (https://github.com/digitalBush/couch-samples) and the CouchDB/Wikipedia demo is also up on github (https://github.com/digitalBush/Couchipedia).

CouchDB Talk at Nashville .NET User Group

I'm excited to be giving my first technology talk next week. I'll be presenting CouchDB and all of its awesomeness to the Nashville .NET User Group. With this being my first time presenting a technology topic, I'm sure it will suck big time. You have to start somewhere, right?  :)

The talk will be a gentle introduction to CouchDB and how it compares/contrasts with relational databases. I'll be showing off some sample .NET code that talks to CouchDB. My goal isn't to say mean things about relational technologies; I just want people to consider the alternatives. There is a lot of cool stuff out there to store data that doesn't have SQL in its name.

If you are in the Nashville area, come watch me make an ass out of myself speak next Thursday.

CouchDB Bulk Load Performance

Yesterday I wrote a very long (sorry!) post about my technique for bulk loading data into CouchDB. I didn't want to make that post any longer, so today I'm going to talk about how well the bulk loading performs. All of these numbers come from my machine which is a Core2 Quad Q9000 @ 2Ghz with 8GB RAM and a single 7200 RPM drive.

As you may recall, the main problem I had was getting too much stuff in memory and having everything blow out with an out of memory exception. The solution I posted yesterday keeps memory under control and seems to keep all four processors busy. Here's a screenshot of task manager while it is running.

Armed with a stopwatch and the CouchDB web interface, I did some crude timings.  I may go back at some point and wrap my loader with some timing code so that I can generate some minute-by-minute graphs, but this will do for now.

Elapsed Time (minutes) Database Size (gigabytes) Document Count
1 .33 19,000
2 .7 49,000
3 1.0 89,000
4 1.4 134,000
5 1.8 181,000
10 3.6 503,000
15 5.1 843,000
20 6.7 1,201,000
25 7.9 1,473,000
30 9.1 1,759,000

I was really impressed with the numbers! After 30 minutes, I was averaging 977 documents/second and and 5.2 megabytes/second. Keep in mind this is all running local on my machine, but the numbers sure are encouraging.

Throttling Asynchronous Tasks

I was recently tasked with load testing CouchDB. We have been using the document store very heavily for several new projects and we really needed to understand what it was capable of. More importantly, we needed to try and identify where it starts to fall apart. Knowing the performance characteristics of the software you are using is vital when making decisions about scaling.

The first step was to identify a large data set. I started to go grab the stack overflow data when a friend of mine pointed me to wikipedia. Now this is a set of data to toy with. It's an xml file which contains the more than 9.5 million pages which make up wikipedia. Let that soak in for a minute. Unzipped it weighs in at 25.4GB and I'm sure that it has grown since I downloaded last.

Time to play!
The general structure of the data looks like this:

    public class Page
        [JsonProperty("_id")]
        public string Id { get; set; }
        public string Title { get; set; }
        public string Redirect { get; set; }
        public Revision Revision {get;set;}
    }

    public class Revision {
        public string Id { get; set; }
        public DateTime Timestamp { get; set; }
        public Contributor Contributor {get;set;}
        public string Minor {get;set;}
        public string Comment { get; set; }
        public string Text {get;set;}
    }

    public class Contributor {
        public string Username { get; set; }
        public string Id { get; set; }
        public string Ip { get; set; }
    }

After identifying the structure, my next step was to actually parse the xml file. After a few minutes of banging my head against the wall, I ended up at this. I use an XmlReader to scan forward through the file and use the LINQ to XML bits to break off the page elements.

    public class WikiFileParser {
        string fileName;

        public WikiFileParser(string fileName){
            this.fileName = fileName;
        }

        public IEnumerable<XElement> GetPages() {
            var file = new StreamReader(fileName);
            var reader = XmlReader.Create(file);
            reader.MoveToContent();

            while (reader.Read()) {
                if (reader.NodeType == XmlNodeType.Element && reader.Name == "page"){
                    XElement x = XNode.ReadFrom(reader) as XElement;
                    if (x != null)
                        yield return x;
                }
            }
        }
    }

The reason I chose this method is that I needed to keep a minimal amount of data in memory and I wanted to use the convenience of the LINQ to XML api to populate my objects.

Before I get to the meat, I need to share a few extension methods:

    public static XElement Named(this XElement elm, string name){
        var newElm = elm.Element("{http://www.mediawiki.org/xml/export-0.4/}" + name);
        if (newElm == null)
            newElm = new XElement("dummy");
        return newElm;
    }

This one is to get around some weird namespace issues that I encountered. If someone knows how I can avoid this, please let me know!

    public static IEnumerable<IEnumerable<T>> Chunk<T>(this IEnumerable<T> pages, int count){
        List<T> chunk = new List<T>();
        foreach (var page in pages){
            chunk.Add(page);

            if (chunk.Count == count){
                yield return chunk.ToList();
                chunk.Clear();
            }
        }
        yield return chunk.ToList();
    }

This one is kind of interesting. I'm using this to batch up single entities into clumps so that I can perform bulk operations.

Finally, on to the meat!
I'll start by posting the code and the I'll explain each method in detail.

    public class BulkLoader{
        string uri;
        public BulkLoader(string uri) { this.uri = uri; }

        public void Load(string filename){
            Action<IEnumerable<XElement>> saveAction = Save;
            var file = new WikiFileParser(filename);

            var workers = file
                .GetPages()
                .Chunk(1000)
                .Select(x => saveAction.BeginInvoke(x, null, null))
                .Aggregate(new Queue(),
                           (queue, item) =>{
                               queue.Enqueue(item);
                               if (queue.Count > 5)
                                   queue.Dequeue().AsyncWaitHandle.WaitOne();
                               return queue;
                           });

            //Wait for the last bit to finish
            workers.All(x => x.AsyncWaitHandle.WaitOne());
        }

        void Save(IEnumerable<XElement> elms){
            var json = new { docs = elms.Select(x => MakePage(x)) }.ToJson(false);
            var request = WebRequest.Create(uri);
            request.Method = "POST";
            request.Timeout = 90000;

            var bytes = UTF8Encoding.UTF8.GetBytes(json);
            request.ContentType = "application/json; charset=utf-8";
            request.ContentLength = bytes.Length;

            using (var writer = request.GetRequestStream()){
                writer.Write(bytes, 0, bytes.Length);
                request.GetResponse().Close();
            }
        }

        Page MakePage(XElement x){
            var rev = x.Named("revision");
            var who = rev.Named("contributor");
            return new Page(){
                Title = x.Named("title").Value,
                Redirect = x.Named("redirect").Value,
                Id = x.Named("title").Value,
                Revision = new Revision(){
                    Id = rev.Named("id").Value,
                    Timestamp = Convert.ToDateTime(rev.Named("timestamp").Value),
                    Contributor = new Contributor(){
                        Id = who.Named("id").Value,
                        Username = who.Named("username").Value,
                        Ip = who.Named("ip").Value
                    },
                    Minor = rev.Named("minor").Value,
                    Comment = rev.Named("comment").Value,
                    Text = rev.Named("text").Value,
                }
            };
        }
    }

MakePage is responsible for turning the xml into a plain object. Here is where I needed that funky extension method to keep from typing the namespace over and over. There's not much to see here.

Save is responsible for persisting the object to CouchDB. Alex Robson will roll his eyes when he sees that part. He has written a tremendously awesome .NET CouchDB API called Relax which you can find on github. The reason I decided not to use his api is because I was trying to eek out every bit of speed increase that I could find.

Load is where the magic happens. This is the part that tripped me up for quite a while. At first I was using Parallel.ForEach and I kept running into out of memory exceptions. It wasn't until Alex put up this blog post that I saw where I went wrong. I used a similar approach, but with plain old IEnumerable instead of IObservable / Reactive Extensions.

Then, Alex decided to one up me the other night before I even finished writing this post. You see, the problem with the original solution is that it would collect 5 asynchronous tasks and then wait for them all to finish before starting 5 more. My implementation suffered the same problem.

Sing with me; "Anything Rx can do, IEnumerable can do with more coooooode." I'm sorry that I just put you through that; my musical skills are definitely lacking. I really should get back to explaining code now. So, you see I start by looking at the stream of pages. I collect 1000 and then I fire off an asynchronous invoke of the batch save I describe above. The result of that is a handle to the asynchronous task which I can then use to track the status of it. I then aggregate that into a queue where will collect up to 5 tasks and then start dequeuing the oldest task and then waiting until it's finished.

The result? I can keep 5 tasks running constantly always pumping data. Since the data set is rather large and I'm using fairly good sized batches, this produces quite a bit of memory pressure. If you want to try this at home with not so much memory, then back down the batch size.

This was a fun task to do. Since I've already lost most readers somewhere about 15 paragraphs back, I'll save the performance for another post.