Thinking Database Performance: Handling BIG Database Table "UserActivity"

Let's start with the stats -

We have over 70 million records in user_activity table now. As the name implies, user_activity keeps record of users' actions in applications.

The growths for year 2007 were like (in million) -

1st quarter: 6
2nd quarter: 8.5
3rd quarter: 10.5
4th quarter: 12

These huge number of records were giving a lot more pain than expected. SharePlex tool can not sync this big table between databases. This table is occupying 1/3 size of the whole database. Handling dump requires extra time and space. We could go for oracle partitioning but it will not serve our purpose here.

I thought to handle it by ourselves. I decided to to split user_activity table into a number of tables according to date range. For example, all 2004 data would reside in one table, 2005 data would go to another table, the same for 2006 data as well. I am keeping these yearly data together as we had less amount of data during those years. Then I split data quarterly from 2007 and onwards. Again, this is not fixed, in 2009 when we would have 100K system users, we might need to split monthly :)

Now, user_activity table is divided into a number of tables like -

UA_200401_200412
UA_200501_200512
UA_200601_200612
UA_200701_200703
UA_200704_200706
UA_200707_200709
UA_200710_200712
USER_ACTIVITY

After each quarter of 2008, at some time, I will run the dynamic partitioning script which will split user_activity again and will create a new table. For example, the first quarter table will be UA_200801_200803.

We have programming challenges here. What if when user select a date range where we need to fetch data from multiple tables and how do we formulate a clean logic for that?

Okay, we need some generic mechanism so that the code remains clean and scalable. I created a table USER_ACTIVITY_SPLIT_INFO which provides the necessary table(s) information from where user activity data will be looked up after breaking the original table. This table contains information like follows -

TABLE_NAME	BEGIN_DATE	END_DATE
UA_200401_200412	25-OCT-03 02.23.22.000000 PM	31-DEC-04 03.50.56.000000 PM
UA_200501_200512	01-JAN-05 05.09.38.000000 AM	31-DEC-05 11.58.52.000000 PM
UA_200601_200612	01-JAN-06 12.01.23.000000 AM	31-DEC-06 11.59.55.956000 PM
UA_200701_200703	01-JAN-07 12.00.27.530000 AM	31-MAR-07 11.59.59.344000 PM
UA_200704_200706	01-APR-07 12.00.00.786000 AM	30-JUN-07 11.59.58.917000 PM
UA_200707_200709	01-JUL-07 12.00.00.055000 AM	30-SEP-07 11.59.57.929000 PM
UA_200710_200712	01-OCT-07 12.00.00.811000 AM	31-DEC-07 11.59.57.893000 PM
USER_ACTIVITY	01-JAN-08 12.00.02.140000 AM

Now, for example, user is seeking data for a date range 11-01-2007 to 04-01-2008. The following query will return the name of table(s) from where data need to be fetched.

SELECT x.table_name
FROM user_activity_split_info x
WHERE x.begin_date BETWEEN TO_DATE('11-01-2007','mm-dd-yyyy') AND TO_DATE('04-01-2008','mm-dd-yyyy')
OR x.end_date BETWEEN TO_DATE('11-01-2007','mm-dd-yyyy') AND TO_DATE('04-01-2008','mm-dd-yyyy');

Output:

TABLE_NAME

UA_200710_200712

USER_ACTIVITY

So it just returned two table names where the data would be found.

In our application code, a method will return the name of table(s) based on the user's date input and we would fetch data directly from those tables and add with combine results together. The programmer does not have to know how many tables to deal with!

One more thing, we don't have to worry about the sequence number which would cross limit after certain years. Now we can reset user_activity sequence at any point just with a single command.

I also came up with another approach which would provided more simpler programming - almost noting to change in code but that approach requires user_activity data duplication by overlapping each three months data! This is not feasible for a table like user_activity - but it was an option.

Let's see how things work!

Thinking Database Performance

Apr 13, 2008

Handling BIG Database Table "UserActivity"

No comments: