Data Exchange Basics

In the past month I've run into two clients that just don't understand how to take in outside data. I really didn't think that this post was necessary until I had these recent conversations. I thought I would just lay it out there how I feel that data exchange works. It's really quite simple.

First, I should describe my background. I work for a national PPO healthcare network. We directly contract with medical providers and also serve as a PPO aggregator taking in data from several sources. We take data from other networks, wrap it up alongside our own, and then redistribute the whole shebang to clients. Multiple gigabytes of data flow into and out of systems every month.

Taking data from other sources can be difficult. The burden is on the recipient to "scrub" the data and make it fit into your own formats and standards. The only data that you can really trust is your own. There is only so much that you can expect the sender to do as they already have the data in the format that they need. Basically the sender needs to do two things in order to make a successful exchange possible.

  1. Provide a unique record id. This is absolutely critical.  Some clients I've dealt with don't have truly unique record level identifiers.  Sometimes a composite of various fields can be used, which is perfectly acceptable but not exactly ideal.  When there is no identifier, then it becomes really hard to know when data changes.
  2. Provide a consistent and reliable feed of data. The data that is being provided needs to updated regularly and be in a consistent format.  A field that you said was numeric can't magically become alphanumeric without something screwing up. The worst case scenario is when data gets changed in a subtle way that somehow gets misinterpreted on the other end without someone catching it.  When you are updating millions records per month, you can't eyeball very much of it.

That's basically it for the sender.  I told you the burden was on the recipient.  So, you have an outside source of data, now what?

  1. Uniquely identify the source.  When you are looking at a piece of data, you have to know where it came from.  Each source of data into your system (including your own) needs to have a identity.  That identity has to be tagged to the data feed.
  2. Store the source's unique record id beside yours. Before you even start to think about it, no you can't use their identifier as your own.  Furthermore, you shouldn't try to generate some kind of composite key based upon the source and the source's id. You need your own internal id.  Anytime I have a mashup of data, I create an identity (id) for foreign keys and then store the source (source_id) and the source's record identifier (external_id).  External id is always an null-able column with varchar(max).
  3. Reformat the data to fit into your system. Your data standards are not the responsibility of the source.  That bears repeating, your data standards are not the responsibility of the source.  This will mean that you will have to make compromises with the data and sometimes possibly outright reject data.  It's just part of it, but you'd rather have less higher quality data, than a bunch of crappy data.
  4. Notify the source of problems so that they have the opportunity to make corrections. While I don't like it when it happens, clients will approach me from time to time to point out oddities within my own data.  It's going to happen, and it's only because they have a different perspective on the data and may be looking at different things.  More eyeballs on the data means better quality data in the end.
  5. Trust nothing that you are receiving. Whenever the data exchange discussion is initiated, try to get as much information from the source as possible.  What fields are required?  What data types are expected?  Take all of the information you gather and write code to enforce it. The better job you do here goes a long way in catching senders who violate #2 in the list above.
  6. Use the source's naming conventions.  Somewhere you will be doing a mapping from their data to your own.  When you are referring to their data, name it (in code, in a temporary table, wherever) what they call it in their documentation.  This makes debugging stuff later on so much easier.  For the love of all things holy, never ever refer to their data by position within a  file. Whenever their format changes (sometimes it will happen), then you will be up shit creek. I name it what they call it right up to the point that it leaves their data structures and enters my own so that I can explicitly see the mapping.  "my.StateDiscount = their.Amount-their.FsAllowedAmount" goes a long way in bridging communication barriers 10 months down the line when an odd problem shows it's ugly head.

Bonus Tip: I've found that full data exchanges much more user friendly than add/update/delete notifications.  This is from my experience with data that is relatively constant with a lot of random updates, which may not necessarily apply for other type of data (time sensitive data comes to mind real fast here).  With relatively static data, a full exchange provides a few benefits for both sides.  The sender doesn't have to worry about tracking changes.  The recipient doesn't have to worry about processing the data in a certain order.  The latest data is the only data that matters.  It also allows the recipient to easily detect if a record gets orphaned somehow.  With change notifications, there is really a need for an acknowledgement from the recipient so that the sender knows the data was processed and is up to date.  The only downside for the full exchange can be file size, but it's a trade off I find acceptable.

That's about all I have on this subject.  If you have any more tips on exchanging data, I'd love to hear them.  This is a big part of what I'm responsible for every day.  A lot of it is common sense, and a lot of it is just stuff I learned by screwing up.  Dealing with other people's data is never fun.

3 Comments so far

  1. NickV @ June 2nd, 2009

    I used to work for a telecom business that dealt with importing (and sometimes exporting) large volumes of call data from multiple sources and formats. Here are some things to add:

    As a sender:
    1. use a repeatable, automated process that will always generate the same format. Recalling the settings (delimiter, line ending, file type) from memory to export data from Excel is unacceptable. Write, AND FOLLOW, the process if necessary; automate it if possible.
    2. Include meta-data with the exported data if applicable. When it was generated, who generated it, who the intended recipient is, and/or location to find information about the format. If you’re receiving the data 2nd hand and have questions, who do you ask?
    3. Occasionally consume your own data. If you can’t consume it, how can you expect someone else to? Providing the process to consume the data, even if it uses a different language or program, speaks volumes and can reduce questions.

    As a receiver:
    1. Don’t trust the format specification to be accurately followed by the sender. Two senders may interpret one standard differently.
    2. Don’t get mad when the format or data is invalid. Whoever provided the data may be new to the job or may be overworked. Not to mention, your words may come back to haunt you.
    3. Reuse existing import tools. Rolling your own code to import a file can introduce errors, especially when someone else has already done all the hard work. FileHelpers for .NET can handle CSV, quoted fields, fixed width, fixed width with variable record types, data type conversion, and more. Chances are, the pre-built import tools already have efficient fixes for unexpected complex import problems.

  2. NickV @ June 2nd, 2009

    Link to FileHelpers for .NET:
    http://www.filehelpers.com/

  3. josh @ June 2nd, 2009

    @NickV Very good points. Everything we do here is automated for both imports and exports. We also use the FileHelpers library and it’s a great tool.

Leave a reply