How to Migrate Oracle Data With Ruby

June 30, 2014

oracle1

Data migrations are the pinnacle of any web development project. It feels like climbing to the top of a mountain just to look back down and jump. On a typical project, you can test everything against the new schema. Throw caution to the wind and make radical changes at a hundred miles per hour. But, what about legacy data that is still in production?

In this article, I’ll delve deep into what it takes to complete a successful data migration in Oracle. Modern web developers have many tools at their disposal. Here, I’ll explain why data migration scripts can be a joy to work with in Ruby. I’ll be dealing with Oracle instead of migrations in Rails.

I will be using OCI8 in my examples, which is an Oracle adapter for Ruby. It allows you write Ruby code to talk to your Oracle database server.

To begin, data migrations involve a database normalization. This Wikipedia article explains:

Database normalization is the process of organizing the fields and tables of a relational database to minimize redundancy. Normalization usually involves dividing large tables into smaller (and less redundant) tables and defining relationships between them. The objective is to isolate data so that additions, deletions, and modifications of a field can be made in just one table and then propagated through the rest of the database using the defined relationships.

I have just completed a project in which my legacy database was full of duplicate data. This led to data quality issues and some strange code. A bloated and ugly database guarantees code that is full of wrong and weird hacks. In short, my project was in dire need of a beauty salon.

Setup

Let’s define a couple of tables:

CREATE TABLE answer (
  answer_id NUMBER PRIMARY KEY,
  question_id NUMBER,
  answer_sequence NUMBER,
  answer_type VARCHAR2(7),
  answer_text VARCHAR2(31),
  updt_date DATE
  );
CREATE TABLE question (
  question_id NUMBER PRIMARY KEY,
  question_text VARCHAR2(31),
  placement NUMBER,
  updt_date DATE
  );

Looking at this example, we are dealing with your basic questionnaire. Every answer belongs to a question and answers have a specific type. Answers can either be radioboxes or checkboxes depending on the question.

This is what our data looks like.

First question:

INSERT INTO question (
  question_id, -- why not just id?
  question_text, -- question gets implied
  placement, -- good, I like it!
  updt_date -- where is create date?
  )
VALUES (
  1,
  'A question',
  1,
  sysdate
  );
INSERT INTO answer (
  answer_id,
  question_id,
  answer_sequence, -- how about placement?
  answer_type,
  answer_text,
  updt_date
  )
VALUES (
  1,
  1,
  1,
  'MLT', -- duplicate
  'First answer',
  sysdate
  );
INSERT INTO answer (
  answer_id,
  question_id,
  answer_sequence,
  answer_type,
  answer_text,
  updt_date
  )
VALUES (
  2,
  1,
  2,
  'MLT', -- duplicate
  'Second answer',
  sysdate
  );
INSERT INTO answer (
  answer_id,
  question_id,
  answer_sequence,
  answer_type,
  answer_text,
  updt_date
  )
VALUES (
  3,
  1,
  3,
  'MULTI', -- what? that’s, messed up!
  'Third answer',
  sysdate
  );

Second question:

INSERT INTO question (
  question_id,
  question_text,
  placement,
  updt_date
  )
VALUES (
  2,
  'Another question',
  2,
  sysdate
  );
INSERT INTO answer (
  answer_id,
  question_id,
  answer_sequence,
  answer_type,
  answer_text,
  updt_date
  )
VALUES (
  4,
  2,
  1,
  'CHK', -- duplicate
  'First answer',
  sysdate
  );
INSERT INTO answer (
  answer_id,
  question_id,
  answer_sequence,
  answer_type,
  answer_text,
  updt_date
  )
VALUES (
  5,
  2,
  2,
  'CHK', -- duplicate
  'Second answer',
  sysdate
  );

Let’s see that data:

SQL> SELECT answer_id, question_id, answer_sequence, answer_type, updt_date FROM answer;

 ANSWER_ID QUESTION_ID ANSWER_SEQUENCE ANSWER_ UPDT_DATE
---------- ----------- --------------- ------- ---------
         1           1               1 MLT     23-JUN-14
         2           1               2 MLT     23-JUN-14
         3           1               3 MULTI   23-JUN-14
         4           2               1 CHK     23-JUN-14
         5           2               2 CHK     23-JUN-14

SQL> SELECT question_id, placement, updt_date FROM question;

QUESTION_ID  PLACEMENT UPDT_DATE
----------- ---------- ---------
          1          1 23-JUN-14
          2          2 23-JUN-14

Yuck! I have data duplication, inconsistent and unimaginative column names and half baked time stamps. As a direct consequence, my data now has nasty data quality issues. Oh boy.

We can do better:

CREATE TABLE answers (
  id NUMBER CONSTRAINT answers_pk PRIMARY KEY,
  question_id NUMBER,
  placement NUMBER,
  text VARCHAR2(31),
  created_at DATE,
  updated_at DATE
  );
CREATE TABLE questions (
  id NUMBER CONSTRAINT questions_pk PRIMARY KEY,
  type_id NUMBER,
  placement NUMBER,
  text VARCHAR2(31),
  created_at DATE,
  updated_at DATE
  );
CREATE TABLE question_types (
  id NUMBER CONSTRAINT question_types_pk PRIMARY KEY,
  name VARCHAR(7),
  created_at DATE,
  updated_at DATE
  );

My data model here is heavily influenced by Rails. Answer types get ripped completely out of the answer table and placed in question. Instead of putting repeated question type names in the table, I have placed it in a separate table and added a relationship. Column names are simple and intuitive.

Pitfalls

Presumably you have working code against the new schema, along with passing tests. Let’s now turn our attention towards the production data.

Be sure to install the gem:

gem install ruby-oci8

Let’s start with this simple migration script:

require 'oci8'

src = OCI8.new('username/password@schema')
tgt = OCI8.new('username/password@schema')
tgt.autocommit = false
src.exec("SELECT question_id,
            (
              SELECT answer.answer_type
              FROM answer
              WHERE answer.question_id = question.question_id
                AND ROWNUM = 1
            ) question_type,
            question_text,
            placement,
            updt_date
          FROM question") do |qr|
  src.exec("SELECT answer_id,
              answer_sequence,
              answer_text,
              updt_date
            FROM answer
            WHERE question_id = :1", qr[0].to_i) do |ar|
    puts ar
  end
end
src.logoff
tgt.logoff

OCI8 is not intuitive in its error messages, so I recommend incremental steps. Here, I am querying the src data to make sure everything works. Notice how I can pass in parameters to the exec method using basic Oracle. I convert types to Ruby types to make sure nothing gets misinterpreted in OCI8. The tgt.autocommit = false lets me test out my migration code until I’m ready to commit changes. The .logoff invocation gracefully kills the connection to Oracle. Since I am attaching answer types to questions, I must sub query it.

More Examples

With this sound foundation in place, let’s start populating the questions table. Assuming we are inside the qr callback:

QUESTION_TYPE_IDS = { "CHK" => 1, "MLT" => 2 }
tgt.exec("SELECT questions_seq.nextval FROM dual") do |qid|
  tgt.exec("INSERT INTO questions (
              id,
              type_id,
              placement,
              text,
              created_at,
              updated_at
              )
            VALUES (
              :1,
              :2,
              :3,
              :4,
              :5,
              :6
              )",
    qid[0].to_i,
    QUESTION_TYPE_IDS[qr[1]].to_i,
    qr[3].to_i,
    qr[2].to_s[0..30],
    qr[4].to_date,
    qr[4].to_date)
  end
end

I am using a simple hash called QUESTION_TYPE_IDS to transform question types into ids. Oracle enforces sequences for primary keys, which forces me to place the INSERT code inside the qid callback. In my hashes, I am type casting to make sure no nil values creep into the insert. OCI8 will throw nasty and incomprehensible errors if nothing matches the hash.

For the question text, the VARCHAR2 types has a max length we don’t want to exceed. To do this, just convert to a string and do [0..30] to truncate the value. This works even if the string is shorter than the limit.

To finish, let’s focus our attention towards the answers table. This block of code goes inside the ar callback.

tgt.exec("SELECT answers_seq.nextval FROM dual") do |aid|
  tgt.exec("INSERT INTO answers (
              id,
              question_id,
              placement,
              text,
              created_at,
              updated_at
              )
            VALUES (
              :1,
              :2,
              :3,
              :4,
              :5,
              :6
              )",
    aid[0].to_i,
    qid[0].to_i,
    ar[1].to_i,
    ar[2].to_s[0..30],
    ar[3].to_date,
    ar[3].to_date)
end

This is much of what we saw in the questions migration. Here, we are mapping the question_id from qid from the questions block. This way, my data migration respects the relationships between objects.

With all this out the way, flip the switch and watch the magic unfold.

tgt.autocommit = true

Now for the most difficult feat yet: sit back, throw your hands behind your head, and let it do its thing. For my particular project, this migration took over 15 minutes. It felt like I was plunging into the depths of the abyss, wroght with unspeakable anxiety, time ceasing to exist. But, it completed successfully and I landed safely on the other side.

Wrapping Up

Time to check our final answers and questions tables and see how everything looks. I’ll omit VARCHAR2 fields for simplicity. I’m using sqlplus in Oracle:

SQL> SELECT id, question_id, placement, created_at, updated_at FROM answers;

        ID QUESTION_ID  PLACEMENT CREATED_A UPDATED_A
---------- ----------- ---------- --------- ---------
         1           1          1 10-JUN-14 10-JUN-14
         2           1          2 10-JUN-14 10-JUN-14
         3           1          3 10-JUN-14 10-JUN-14
         4           2          1 10-JUN-14 10-JUN-14
         5           2          2 10-JUN-14 10-JUN-14

SQL> SELECT id, type_id, placement, created_at, updated_at FROM questions;

        ID    TYPE_ID  PLACEMENT CREATED_A UPDATED_A
---------- ---------- ---------- --------- ---------
         1          2          1 10-JUN-14 10-JUN-14
         2          1          2 10-JUN-14 10-JUN-14

Beautiful.

If interested, you may download all code samples from GitHub.

Happy Hacking!