Thursday, April 22, 2010

Architechture-Netezza

17 comments:

  1. Netezza Architecture:
    The Netezza Performance Server (NPS®) system’s architecture (Netezza, 2006), depicted in
    Figure 1, is a two-tiered system designed to handle very large queries from multiple users. The
    first tier is a high-performance Linux symmetric multiprocessing host. The host compiles queries
    received from business information (BI) applications, and generates query execution plans. It
    then divides a query into a sequence of sub-tasks, or snippets, which can be executed in parallel,
    and distributes the snippets to the second tier for execution. The host returns the final results to
    the requesting application thus providing the programming advantages of appearing to be a
    traditional database server.

    The second tier consists of dozens to hundreds or thousands of Snippet Processing Units (SPUs)
    operating in parallel. Each SPU is an intelligent query processing and storage node, and consists
    of a powerful commodity processor, dedicated memory, a disk drive and a field-programmable
    disk controller with hard-wired logic to manage data flows and process queries at the disk level,
    as depicted in Figure 2. The massively parallel, shared-nothing SPU blades provide the
    performance advantages of massively parallel processors.

    ReplyDelete
  2. SELECT VIEWNAME,OWNER,CREATEDATE,DEFINITION FROM _V_VIEW WHERE OBJTYPE='VIEW';

    ReplyDelete
  3. Query to get a list of tables in a database:
    SELECT TABLENAME,OWNER,CREATEDATE FROM _V_TABLE WHERE OBJTYPE='TABLE';

    ReplyDelete
  4. Query to get a list of columns from a table or a view:
    SELECT ATTNUM,ATTNAME FROM _V_RELATION_COLUMN WHERE NAME=UPPER('TABLE NAME') ORDER BY ATTNUM ASC;
    Query to get list of user groups on the box:
    SELECT GROUPNAME,OWNER,CREATEDATE,ROWLIMIT,SESSIONTIMEOUT, QUERYTIMEOUT,DEF_PRIORITY,MAX_PRIORITY FROM _V_GROUP;
    Query to get list of users and the groups they are in, on the box:
    SELECT GROUPNAME,OWNER,USERNAME FROM _V_GROUPUSERS;
    (Does not give any LDAP users in this query)
    Query to find the number of rows in a table without actually querying the table:
    (Sometimes needed for some really huge tables of rowcount > 80 Billion)
    SELECT RELNAME TABLE_NAME,
    CASE
    WHEN RELTUPLES < 0
    THEN ((2^32) * RELREFS) + ((2^32) + RELTUPLES )
    ELSE ((2^32) * RELREFS) + ( RELTUPLES )
    END NUM_ROWS FROM _T_CLASS, _T_OBJECT WHERE
    _T_OBJECT.OBJID=_T_CLASS.OID AND _T_OBJECT.OBJCLASS=4905 -- DISPLAY ONLY TABLES
    AND RELNAME = UPPER('TABLE NAM>');
    Query to check if any of the SPU's are running slower than the rest:
    (This actually gives the read-write speed of each SPU that is online)
    SELECT HWID, BYTE_COUNT/TOTAL_MSEC FROM _VT_DISK_TIMING
    ORDER BY 2;
    One more query.. To get the list of tables and thier skew and size:
    SELECT TABLENAME,OBJTYPE,OWNER,CREATEDATE,USED_BYTES,SKEW
    FROM _V_TABLE_ONLY_STORAGE_STAT WHERE OBJCLASS = 4905 OR OBJCLASS = 4911
    ORDER BY TABLENAME;

    ReplyDelete
  5. Get the number of active queries running:
    select * from _v_session where status='active'

    ReplyDelete
  6. ZONE MAP:
    --------
    To clarify, Zone maps are not like indexes. They are like partitions. If you order the data that is being loaded into a table on a particular column, zone maps get created on that column as long as it is not a varchar column. These zone maps help in filtering out the data faster as netezza has a way to know where all that particular data is stored in each spu for that zone map. Distribution and zone maps are two different things. Each have their own way of working. They are not related to each other. A table distributed on random will also have zone maps as long as you order the data that is loaded into it. For example: If you know that the data from a fact table is always queried on a date criteria, the query performance would be a lot faster if you make sure that the data that is loaded into the table is ordered by date during loading. This would automatically create a zonemap on date column and when you query is with a condition on date, netezza would know exactly which sector of the harddrive in each of the spu's that data exists.
    Distribution on the other hand is to allow joins between two huge tables to be seamless because, tables distributed on the same column will end up with records from the same key on the same spu so that when joining those two tables over that distributed column, netezza gets the SPU's to process the data at the SPU level before sending them to the host for delivery. This would on the average have a 10 x better performance than a query against tables with different distributions. One should be careful when distributing a table on a column because if the count of records per every unique value of the column is not uniform to the kind of queries that are going to be happening, then you end up making only one spu work all the time instead of getting all the spu's to work to get your result out which would be faster. In essence, if you are having a fact table which you query on date, and you end up distributing that table on date, for a given date, all its records will be residing on the same spu and only that spu which has that date data will be working for a query against that date. Hope this clarifies your question.

    ReplyDelete
  7. Limit the result:
    select ename from emp limit 5;

    Create table using other table
    craete table emp_bck as
    select * from emp
    where 1=2;

    ReplyDelete
  8. Gather Statistics:
    ------------------
    You can go the the NZADMIN tool and right click on the table and look at the statistics option in the menu. It lists out all the columns and their statistics. If that says that the statistics are out of date or if they are unavailable, that means that you loaded data after statistics generation or you never gathered statistics on that table respectively. If it says 'Current' that means that the statistics on that column are current.

    ReplyDelete
  9. Get Session and Drop session:
    -----------------------------
    select * from _v_session where status='active'

    DROP SESSION %sessionid%
    /*SQL statement for dropping a user session. This way no access to NPS host is required*/

    ReplyDelete
  10. Truncate table:
    begin
    execute Immediate 'TRUNCATE TABLE emp';
    end;

    ReplyDelete
  11. Loop Statement:
    ---------------
    The LOOP statement defines an unconditional loop that repeats until terminated by an
    EXIT statement or a RETURN statement

    LOOP

    END LOOP;
    EXIT Statement

    LOOP
    -- some computations
    IF count > 0 THEN
    EXIT; -- exit loop
    END IF;
    END LOOP;

    WHILE Statement
    With the WHILE statement, you can loop through a sequence of statements as long as the
    evaluation of the condition expression is true.

    WHILE amount_owed > 0 AND gift_certificate_balance > 0 LOOP
    -- some computations here
    END LOOP;

    FOR Statement

    Using the FOR statement, you can create a loop that iterates over a range of integer values.


    FOR i IN 1..10 LOOP
    -- some expressions here
    RAISE NOTICE 'i is %',i;
    END LOOP;

    ReplyDelete
  12. Materialized View vs View:
    --------------------------
    Mviews are a lot different from normal views. if you create a Mview on a table, its like a thin layer of your table physically stored(where as a view is logical but not physical). For example a table has got 50 columns and you are regularly using 5 columns of them, you can create a Mview on these 5 columns which will be used instead of the main table whenever there is a query fired on any/all of these 5 columns. You need not have to select the Mview, instead optimizer will automatically consider the Mview and will use it. And one more main diff is tat Mview is like a thin layer of a table so you cannot use joins to create a Mview.

    ReplyDelete
  13. data dictionary tables:
    There are system tables in Netezza which hold details of database structure etc, and also there are ways of storing query history.. You can use the nzadmin tool to examine the system tables and views, and read chapter 13 of the system admin guide to find out about query history.

    ReplyDelete
  14. select procedure,proceduresource from _V_PROCEDURE -- will give source code of the proc

    ReplyDelete
  15. Factrel Planner
    ----------------
    Factrel planner is something that Netezza uses to determine which table to broadcast when querying table not distributed on the same column. By default the factrel planner is set to 100k in Netezza. You can reduce or increase it globally or for a particular session. Netezza determines the plan that it needs to use by determining which of the tables is a fact and which is a dimension and tries to broadcast the dimension to all the SPU's so that it can get the join between the tables at the SPU level. Broadcast happens at 2 places, Host broadcast or SPU broadcast (Spu broadcast is also known as redistribute as you can see in the plan file). SPU broadcast is faster than the host broadcast as the host broadcast happens at about 60 mb/sec where as the SPU broadcast is over a gigabit network which is roughly a gig/sec. So when you end up having a lot of host broadcast, you would want to make it a SPU broadcast by changing the factrel planner to something higher like 250k or higher so that most of the queries do the spu boradcast. You should be careful about changing this setting because in some cases you would want the broadcast to be a host broadcast to make the query performance to be better.

    ReplyDelete
  16. -> Find the Table Columns and its structure.

    SELECT attnum,attname,format_type,attcolleng from _V_RELATION_COLUMNS where name='Tablename';

    ReplyDelete