Jan 28, 2010

Best Method to archive data

Hi all,
Recently I got into a particular situation:
We have a process (process 1) which goes through all the data inserted today, from one table and do a complex calculation using multiple tables. If a particular condition is not met, I copy the data into a table (ProcessData).  This process will go though > 100K of rows but will insert <1000 rows into the table.
Then I have another process (Process B),  Which takes the rows from ProcessData table one by one and executes a complex processing.
Both Process A and  Process B are windows services working at off peak time.  They access some large and heavily used tables.  (Some tables are having more than 30 million rows and get around 160, 000 kits per day)
Now what is the best way of archiving the data from ProcessData table.
Suggestion 1:  When Process B completes a row delete the row and move the data into archive table
Suggestion 2:  Wait until Process B completes all rows and move all rows to Archive table and truncate the table
Suggestion 3:  When Process B completes a row mark the row (using a flag). Create another scheduled job to move the data using the flag periodically.
What is the best method?

Suggestion 1: This seems to be a clean solution as as and when a row is processed it is archived.  As it handles only one row at a time, the possibility of having higher number of or higher degree of locks is minimal. However, if the process is enclosed in one transaction, the locks on other highly used tables will wait for longer time.

Suggestion 2:  First of all, this method adds additional complexity.  There should be mechanism to update the flag when each row is processed, (otherwise, how will you identify, if the process terminates halfway) and it will have slightly lesser locks on the table. (Copying part is separated from the main process)  Truncate can be used only if there is no unprocessed rows. (What if the first process has started again when the second process is still working)  However, moving the archiving part to the end have few definite advantages:  first of all it could be removed from the main transaction which goes through the main process. Additionally, truncate is minimally logged and takes less log space.
Suggestion 3: Now all processes are separated into individual processes, coordination between them have to be maintained carefully.  Additionally, based on the timing for the scheduled job, the ProcessData table could grow and reading the rows may become IO intensive operations.
In case you have any audit operations on update which many companies have such as updating the last update information, it will generate additional operations.

I may go for first or second option, and not for the third option. It does not offer any additional benefits over the second method.  Additionally, it can make the ProcessData table to grow and Process B to take more time.

What you all think?

No comments:

Post a Comment