Sieve of Eratosthenes in PL/SQL

Yesterday I posted an Oracle SQL query that would produce primes, but it wasn't an implementation of the Sieve of Eratosthenes. Here for your viewing pleasure discomfort is a version in Oracle PL/SQL.

create or replace package josh.sieve as
   type prime_numbers is table of number;
   function calculate ( the_end number ) return prime_numbers pipelined;
end;
/


create or replace package body josh.sieve as
   function calculate ( the_end number ) return prime_numbers pipelined is
      type bitbucket is table of boolean index by binary_integer;
      primes bitbucket;
      max_sqrt number :=floor(sqrt(the_end));
      i number:=3;
      j number:=1;
    begin
        pipe row(2);

        for i IN 1 .. the_end loop
          primes(i) := true;
        end loop;

        while i <= the_end loop
          if primes(i) then
            if(i<=max_sqrt) then
              j:=i*i;
              while j<=the_end loop
                primes(j):=false;
                j:=j+2*i;
              end loop;
            end if;

            pipe row(i);
          end if;

          i:=i+2;
        end loop;
    end;
end;
/

This is the ugliest one so far. It really burns my eyes!

What I liked
I was kind of pleased to see the pipe row(foo); functionality. This is very similar to yield in c# where I can just lazily return numbers as they are found instead of building up the whole list table in advance.

What I hated
There's a bunch here. First off, I tried to write this with a for loop, but there is no way to define the step value. I had to resort to using a while loop with an external counter variable. It also sucks that I have to define a custom type which is just a table of numbers. Finally, I didn't see a clean way to initialize a table with a size and default value.

Why it burns my eyes
I'm not digging that assignments have to have a colon-equal "foo := 42" operator. To the same tune as basic, I hate that blocks end with an end keyword. This is an especially big WTF when you consider that lines are terminated with semi-colons. The end result is a bunch of colon-ish punctuation littered with end keywords.

Regardless, I was pleased with the performance. Running a query like select count(1) from table(josh.sieve.calculate(1000000)); runs in about 600 milliseconds on our db machine. Not too shabby...

Prime Numbers as a SQL Query

This isn't a Sieve of Eratosthenes algorithm, but it's too good not to share it.

select r 
from (
  select rownum+1 r from dual connect by rownum < 10000
)
where
  r=2
  or 0 not in(
    select mod(r,rownum+1) from dual
    connect by rownum<sqrt(r)
  )

This is an Oracle query that will generate all of the primes up to 10,000. The query is not even close to fast, but I think it's kind of funny. It's just your typical everyday brute force method of calculating prime numbers.

More Radius Search Fun

In my previous post, I outlined the process which I'm using to speed up radius searches over a large pool of potential matches. With that I was able to answer the question "How many within n miles?" More recently I've been asked to answer the question "What is the minimum radius which contains n records?"

Example Radius Plot

Example Radius Plot

So, let's say that the red dot represents 2 records and the green dot represents 3 records. If I need to find the minimum radius to contain 4 records, then I'll need to go out to the green dot to satisfy that requirement. I think it's time I introduced my friendly query:

WITH available_records_by_zip as (
      select zip, count(1) locations
      from records
      inner join addresses a using(address_id)      
      group by zip
)
SELECT origin_zip, radius
  FROM (
    SELECT 
      origin_zip, 
      radius, 
      cnt,
      row_number() over(PARTITION BY origin_zip ORDER BY radius) rnk  
    FROM (
      SELECT
          zr.zip_from origin_zip,
        radius,
        SUM(available_records_by_zip .locations) over( --This is the analytic function
          PARTITION BY  zr.zip_from                    --which makes this possible.
          ORDER BY radius
        ) cnt
      FROM        
      zip_radius_map zr,
      available_records_by_zip
      where zr.zip_from=:search_zip
      and zr.zip_to= available_records_by_zip.zip
    )
    WHERE cnt >= 4 -- Number of locations needed
  )
WHERE rnk = 1

This is what it does: Given a pool or records, group then by origin zip and order them by distance. Only pull those which meet the minimum requirement by summing their count with all of the counts before them. Order that set by radius and take the first one. That guy is your minimum requirement.

In the real world, I'm joining to a table of target zips, but you get the idea. Doing it this way(cheating), I can do a single analysis for a few thousand target zips in a handful of seconds. The time it takes really just depends on the size of your set of available providers. These Oracle analytics functions are very powerful once you understand them.

What I’ve Learned as an Accidental DBA

Working for a small company has it's advantages and disadvantages.  One of the biggest disadvantages has to be the lack of specialist employees.  I wear many hats at work.  Most days I'm a software developer.  Some days I get to be a network administrator.  When a server is down, then I'm the systems administrator. One day I was the janitor.  Since I'm big and tall, I'm frequently mistaken for a furniture mover.

It's a blessing and a curse.  There is never a dull moment and always something for me to do.  I've acquired many skills out of necessity.  One of those skills has been database administration.  More specifically, Oracle administration.  I've learned to hate the database server.  When something goes wrong in that box of mystery, everything grinds to a halt and people start getting pissed.  So, in no particular order, I'd like to share some nuggets of information I've learned over the years.

  • Cheap RAID cards are a bitch.  I've had the opportunity to witness the carnage that can happen when good RAID goes bad.
  • A fast disk is a fast database.  Unless you can keep your entire database in memory, your data is going to need to be read from disk.  Disk IO is a major bottleneck.
  • RAID is awesome when it works.  What's faster than one fast disk?  More fast disks and a fast RAID controller.  More is better. Just make sure you invest in quality equipment.  Good RAID controllers aren't cheap for a reason.  The expensive ones don't corrupt your data.
  • RAID 10 is better than RAID 5.  The only exception is if you never need to write data with any speed.  Disks are so cheap that it doesn't make any sense to try and save the money.  RAID 10 is just that much faster.
  • Indexes are a double edged sword.  They are necessary for fast queries, but too many will slow inserts and updates.  Also, indexes on large tables should be chosen carefully as the indexes can consume quite a bit of disk.
  • A bad query can bring the database to its knees.
  • Locks protect multiple requests from stomping on top of each other.  They are also good at stopping inserts and updates from happening when one gets stuck.
  • Keeping the statistics up to date makes sure that the database choses an optimal execution plan for retrieving your data.  The database needs to know how your data looks. 
  • Full table scans aren't always bad, but most of the time they are.  The only time they aren't is generally when you are querying a majority of the data in the table.
  • Be careful where you store your LOBs. Storing LOBs that you don't always need access to inline with other data can be slow.  It reduces row density per data block which means more disk IO is needed to get your data.
  • There are about 42 different ways to query the same data.  Most of them are terrible.

Obviously you should take this information for what it is.  This is just a few things I've come to believe based upon 5 years of banging my head against a wall.  I write software, and somehow I just inherited this database responsibility.  I'm not sure how, I think I was just closer to the server when it died the first time.  

Most of the problems I've had with databases have been hardware related. A good backup strategy goes a long way to keeping your ass out of the fire.  It's a terrible feeling when you lose your own data.  It's an even worse feeling when you lose someone else's data.  If you are an accidental DBA, I feel your pain.  We should all start a support group.

Code Dump: NPI PL/SQL Validation

I just want to throw this out there since it's useful. I searched and didn't come up with an Oracle specific implementation of this algorithm, so here it is. This function checks if a given National Provider Identifier(NPI) is valid. The Luhn Algorithm is used to validate NPI after a prefix of "80840" is applied.

This code is provided as is. I'm not making any guarantees that it's 100% right and I'm not responsible if anything bad happens as a result of your use of this. So, don't yell at me if small animals are harmed with the use of this code.

--Returns 1 if valid, 0 if invalid 
CREATE OR REPLACE FUNCTION valid_npi (npi in varchar)    
RETURN number
IS  
    tot NUMBER := 24; --80840 prefix for npi
    val NUMBER := 0;
BEGIN  
  if( 
    npi is null or 
    length(npi) <> 10 or 
    LENGTH(TRIM(TRANSLATE(npi, '0123456789', ' ')))>0 --not numeric
  ) then
    return 0;
  end if;
  
  for i IN reverse 1 .. LENGTH(npi) loop
    val:=SUBSTR(npi, i, 1);
    if mod(i,2)=1 then
      val := val * 2;      
      if(val > 9) then
        val:=val-9;
      end if;
    end if;
    
    tot := tot + val;    
  end loop;

  if(MOD(tot, 10)=0) then
    return 1;
  else
    return 0;
  end if;  
END valid_npi;

Next Page »