Monday, September 28, 2009
We were trying new stuff when we got a chance to work on a new BCU with 48 partitions, and awesome memory (8 GB per node), and tried using the MDC with all partitioning capabilities, in hashing and range. One of my colleagues used a variable high on cardinality and high on quantiles and no frequent values (DB2 terms). The load took more than 1 hour for a million rows, and it grew tremendously in size. Thus we started reading each and every line of the redbook www.redbooks.ibm.com/redbooks/pdfs/sg247467.pdf
In my case, I used the year(order_thro_dt) and client_cd . So
Client_cd = 300 Client_cd = 301 Client-cd – 302 Client-cd - 303
Client_cd = 300
Client_cd = 301
Client-cd – 302
Client-cd - 303
The cube gives you an ex if you use year and some frequently used value as dimentions,
Imagine using ssn, the full 1st row in the violet will be for one ssn and one client, 2nd for one ssn and one client. Such a waste and such slow inserts since db2 has to ‘allocate the block when inserting every new value for the dimention’, so one per ssn – 123456789 and one when ssn = 234567890.
Now what happens during deletes is, db2 drops the whole block only when last value is deleted, for example , if we delete all instances of 300, the first row slice will go off(a), and if we delete all the year-2004 values, the 2nd column slice (b) will go off.
So the inserts of a new year like 2009 will make a case for new column dimention block to get created vis a vis new client for the row dimention block.
MDC and reorgs :
The reason why in case of MDC db2 would not need reorging is because the table is organized by dimensions, which are blocks, there will be only one pointer to the block, and not to every row of the table in case of normal indices.
Thus new blocks are added only when the existing ones get filled, for ex 300 has more than one extent of freq values, the 2nd block gets created. In this case, the 2nd and new block will have only one value and rest would all be empty. Thus the waste in space. So extent size or block size is very important in MDC since smaller blocks will make db2 create new ones, link with previous and take up higher cost, and larger ones waste space as given in the above example.
SQLs that can benefit from MDC :
Blocked indices could be thought of as a floor full of a class of people, one having only American citizens , one with Indian citizens and so on. The first row in the room will be filled with say people above 50 years, 2nd with people from 40-50 years , 3rd with people from 30-40 years.
Now if we want to know how many ‘Americans’ are above 50 years old, we just go to the top floor, and count everyone in the 1st row, end of story, and is the same case if we need count of just Indians, just go to the floor below the top and count the entire one.
So in our case, we have loads of sqls with predicates belonging to clients and the year of service,
Select sum(a), count(B) from table1 where client = ? and year_of_service= ?
This is a part of a typical OLAP query, and hence the assumption that mdc helps analysis queries and are not so helpful in transaction dbs where loading and deleting happens every day.
Now if we need to find someone by the name ‘thomas’, he could be an Indian, an American, a Chinese, and so db2 has to go to each floor and ask each guy if his name was Thomas – a very costly operation.
Select sum(a) from table1 where ssn = ?
For helping these kinds of queries, db2 allows normal indices along with MDC specification in a table.
Create index index1 on table1(ssn);
That will have pointers to each row with ssn value in the table. If it is unique, you can even create unique index, piggy back another non-unique column for ‘index only access’.
The above link to the redbook from IBM that gives you in depth review about MDC, when can they be considered, and the space calculation formulae. This entry was supposed to be a plain English introduction to this beautiful feature that can only be challenged by Oracle’s bit mapped index, but even then, DB2’s implementation wins hands down. The following link tells something about the deferred MDC rollout
This post was not aimed to be an indepth analysis or a strict tech post. So am not discussing about the mathematics of MDC considerations or size calculations. Please take a look at the DEFER value in the ’DB2_MDC_ROLLOUT’ variable and using it in SET CURRENT MDC ROLLOUT MODE statement.
Posted by Arun Srini at 3:18 AM