Archive for June, 2009

What I’ve Learned as an Accidental DBA

Working for a small company has it's advantages and disadvantages.  One of the biggest disadvantages has to be the lack of specialist employees.  I wear many hats at work.  Most days I'm a software developer.  Some days I get to be a network administrator.  When a server is down, then I'm the systems administrator. One day I was the janitor.  Since I'm big and tall, I'm frequently mistaken for a furniture mover.

It's a blessing and a curse.  There is never a dull moment and always something for me to do.  I've acquired many skills out of necessity.  One of those skills has been database administration.  More specifically, Oracle administration.  I've learned to hate the database server.  When something goes wrong in that box of mystery, everything grinds to a halt and people start getting pissed.  So, in no particular order, I'd like to share some nuggets of information I've learned over the years.

  • Cheap RAID cards are a bitch.  I've had the opportunity to witness the carnage that can happen when good RAID goes bad.
  • A fast disk is a fast database.  Unless you can keep your entire database in memory, your data is going to need to be read from disk.  Disk IO is a major bottleneck.
  • RAID is awesome when it works.  What's faster than one fast disk?  More fast disks and a fast RAID controller.  More is better. Just make sure you invest in quality equipment.  Good RAID controllers aren't cheap for a reason.  The expensive ones don't corrupt your data.
  • RAID 10 is better than RAID 5.  The only exception is if you never need to write data with any speed.  Disks are so cheap that it doesn't make any sense to try and save the money.  RAID 10 is just that much faster.
  • Indexes are a double edged sword.  They are necessary for fast queries, but too many will slow inserts and updates.  Also, indexes on large tables should be chosen carefully as the indexes can consume quite a bit of disk.
  • A bad query can bring the database to its knees.
  • Locks protect multiple requests from stomping on top of each other.  They are also good at stopping inserts and updates from happening when one gets stuck.
  • Keeping the statistics up to date makes sure that the database choses an optimal execution plan for retrieving your data.  The database needs to know how your data looks. 
  • Full table scans aren't always bad, but most of the time they are.  The only time they aren't is generally when you are querying a majority of the data in the table.
  • Be careful where you store your LOBs. Storing LOBs that you don't always need access to inline with other data can be slow.  It reduces row density per data block which means more disk IO is needed to get your data.
  • There are about 42 different ways to query the same data.  Most of them are terrible.

Obviously you should take this information for what it is.  This is just a few things I've come to believe based upon 5 years of banging my head against a wall.  I write software, and somehow I just inherited this database responsibility.  I'm not sure how, I think I was just closer to the server when it died the first time.  

Most of the problems I've had with databases have been hardware related. A good backup strategy goes a long way to keeping your ass out of the fire.  It's a terrible feeling when you lose your own data.  It's an even worse feeling when you lose someone else's data.  If you are an accidental DBA, I feel your pain.  We should all start a support group.

Data Exchange Basics

In the past month I've run into two clients that just don't understand how to take in outside data. I really didn't think that this post was necessary until I had these recent conversations. I thought I would just lay it out there how I feel that data exchange works. It's really quite simple.

First, I should describe my background. I work for a national PPO healthcare network. We directly contract with medical providers and also serve as a PPO aggregator taking in data from several sources. We take data from other networks, wrap it up alongside our own, and then redistribute the whole shebang to clients. Multiple gigabytes of data flow into and out of systems every month.

Taking data from other sources can be difficult. The burden is on the recipient to "scrub" the data and make it fit into your own formats and standards. The only data that you can really trust is your own. There is only so much that you can expect the sender to do as they already have the data in the format that they need. Basically the sender needs to do two things in order to make a successful exchange possible.

  1. Provide a unique record id. This is absolutely critical.  Some clients I've dealt with don't have truly unique record level identifiers.  Sometimes a composite of various fields can be used, which is perfectly acceptable but not exactly ideal.  When there is no identifier, then it becomes really hard to know when data changes.
  2. Provide a consistent and reliable feed of data. The data that is being provided needs to updated regularly and be in a consistent format.  A field that you said was numeric can't magically become alphanumeric without something screwing up. The worst case scenario is when data gets changed in a subtle way that somehow gets misinterpreted on the other end without someone catching it.  When you are updating millions records per month, you can't eyeball very much of it.

That's basically it for the sender.  I told you the burden was on the recipient.  So, you have an outside source of data, now what?

  1. Uniquely identify the source.  When you are looking at a piece of data, you have to know where it came from.  Each source of data into your system (including your own) needs to have a identity.  That identity has to be tagged to the data feed.
  2. Store the source's unique record id beside yours. Before you even start to think about it, no you can't use their identifier as your own.  Furthermore, you shouldn't try to generate some kind of composite key based upon the source and the source's id. You need your own internal id.  Anytime I have a mashup of data, I create an identity (id) for foreign keys and then store the source (source_id) and the source's record identifier (external_id).  External id is always an null-able column with varchar(max).
  3. Reformat the data to fit into your system. Your data standards are not the responsibility of the source.  That bears repeating, your data standards are not the responsibility of the source.  This will mean that you will have to make compromises with the data and sometimes possibly outright reject data.  It's just part of it, but you'd rather have less higher quality data, than a bunch of crappy data.
  4. Notify the source of problems so that they have the opportunity to make corrections. While I don't like it when it happens, clients will approach me from time to time to point out oddities within my own data.  It's going to happen, and it's only because they have a different perspective on the data and may be looking at different things.  More eyeballs on the data means better quality data in the end.
  5. Trust nothing that you are receiving. Whenever the data exchange discussion is initiated, try to get as much information from the source as possible.  What fields are required?  What data types are expected?  Take all of the information you gather and write code to enforce it. The better job you do here goes a long way in catching senders who violate #2 in the list above.
  6. Use the source's naming conventions.  Somewhere you will be doing a mapping from their data to your own.  When you are referring to their data, name it (in code, in a temporary table, wherever) what they call it in their documentation.  This makes debugging stuff later on so much easier.  For the love of all things holy, never ever refer to their data by position within a  file. Whenever their format changes (sometimes it will happen), then you will be up shit creek. I name it what they call it right up to the point that it leaves their data structures and enters my own so that I can explicitly see the mapping.  "my.StateDiscount = their.Amount-their.FsAllowedAmount" goes a long way in bridging communication barriers 10 months down the line when an odd problem shows it's ugly head.

Bonus Tip: I've found that full data exchanges much more user friendly than add/update/delete notifications.  This is from my experience with data that is relatively constant with a lot of random updates, which may not necessarily apply for other type of data (time sensitive data comes to mind real fast here).  With relatively static data, a full exchange provides a few benefits for both sides.  The sender doesn't have to worry about tracking changes.  The recipient doesn't have to worry about processing the data in a certain order.  The latest data is the only data that matters.  It also allows the recipient to easily detect if a record gets orphaned somehow.  With change notifications, there is really a need for an acknowledgement from the recipient so that the sender knows the data was processed and is up to date.  The only downside for the full exchange can be file size, but it's a trade off I find acceptable.

That's about all I have on this subject.  If you have any more tips on exchanging data, I'd love to hear them.  This is a big part of what I'm responsible for every day.  A lot of it is common sense, and a lot of it is just stuff I learned by screwing up.  Dealing with other people's data is never fun.