Query Execution in Eon Mode

Posted February 7, 2018 by Soniya Shah, Information Developer

This blog post was authored by Ben Vandiver.

How Vertica distributes query processing across the cluster in Eon mode is a complex topic that is best illustrated through a concrete example. As part of this post, we’ll start with a simple data load and walk through metadata storage and query execution. To begin, we need a table and some data: => CREATE TABLE T (a int) SEGMENTED BY hash(a) all nodes; \! seq 1 1000 | /opt/vertia/bin/vsql -c "COPY T FROM STDIN DIRECT" To see how the database organizes the storage resulting from the load, we look at the storage_containers table: => SELECT sal_storage_id,total_row_count,shard_name,node_name FROM storage_containers ORDER BY 3,4; sal_storage_id | total_row_count | shard_name | node_name -----------------+-----------------+-------------+----------- 02919bece5be68827aaaf86cd60eb28400b000000001bac5 | 228 | segment0001 | node01 02919bece5be68827aaaf86cd60eb28400b000000001bac5 | 228 | segment0001 | node02 021ff4d5dfc925bc18d253c178453a9000c000000001bac5 | 263 | segment0002 | node02 021ff4d5dfc925bc18d253c178453a9000c000000001bac5 | 263 | segment0002 | node03 0241ff05ab3785c430fa818f384bc2e100d000000001bac5 | 255 | segment0003 | node03 0241ff05ab3785c430fa818f384bc2e100d000000001bac5 | 255 | segment0003 | node04 02806eada7df3163e2757e06a55d5fca00a000000001bac5 | 254 | segment0004 | node01 02806eada7df3163e2757e06a55d5fca00a000000001bac5 | 254 | segment0004 | node04 (8 rows) The output contains four distinct containers, each listed twice for the two nodes capable of serving it. Each unique container is part of a shard, in this case a segment shard that is responsible for a region of the hash “segmented by” space. You can see the shard configuration of the system with: => SELECT shard_name,shard_type,lower_hash_bound,upper_hash_bound FROM shards; shard_name | shard_type | lower_hash_bound | upper_hash_bound -------------+------------+------------------+------------------ replica | Replica | | segment0001 | Segment | 0 | 1073741825 segment0002 | Segment | 1073741826 | 2147483649 segment0003 | Segment | 2147483650 | 3221225473 segment0004 | Segment | 3221225474 | 4294967295 (5 rows) Each node is responsible for serving a subset of the shards – the ones the node is subscribed to. You can see the state of node subscriptions with: => SELECT shard_name,node_name FROM node_subscriptions ORDER BY 1,2; shard_name | node_name -------------+----------- replica | node01 replica | node02 replica | node03 replica | node04 segment0001 | node01 segment0001 | node02 segment0002 | node02 segment0002 | node03 segment0003 | node03 segment0003 | node04 segment0004 | node01 segment0004 | node04 (12 rows) By comparing with the containers list above, you can see that the nodes serving a container are those that subscribe to the container’s shard. To answer each query on table T, Vertica must read all four unique containers. However, the database has choices: there are two nodes that know about each container. Each session picks a subscribing node to serve each shard: => SELECT shard_name,node_name FROM session_subscriptions WHERE is_participating ORDER BY 1,2; shard_name | node_name -------------+----------- replica | node01 replica | node02 replica | node03 replica | node04 segment0001 | node02 segment0002 | node03 segment0003 | node04 segment0004 | node01 (8 rows) The “replica” shard (for replicated tables) is served by all nodes, and each segment shard is served by exactly one node. When we run a query, we expect node02 to serve segment0001 and thus container 02919bece5be68827aaaf86cd60eb28400b000000001bac5. You can check this with: => SELECT count(*) from t; count ------- 1000 (1 row) => SELECT storageid,node_name FROM dc_file_reads WHERE transaction_id = current_trans_id() AND statement_id = current_statement() - 1; storageid | node_name --------------------------------------------------+----------- 021ff4d5dfc925bc18d253c178453a9000c000000001bac5 | node03 0241ff05ab3785c430fa818f384bc2e100d000000001bac5 | node04 02806eada7df3163e2757e06a55d5fca00a000000001bac5 | node01 02919bece5be68827aaaf86cd60eb28400b000000001bac5 | node02 (4 rows) If you are familiar with Vertica’s notion of a projection, you’re probably wondering how Eon mode handles projections and nodes. When we look at the projections table, we can see that the segment range for the projection for table T matches the shard subscription layout – 4 ranges assigned, segment0001 served by node02, etc: => SELECT projection_name,anchor_table_name,segment_expression,segment_range FROM projections; projection_name | anchor_table_name | segment_expression | segment_range -----------------+-------------------+--------------------+------------------------------------------------------------------------- T_b0 | T | hash(T.a) | implicit range: node02[25.0%] node03[25.0%] node04[25.0%] node01[25.0%] (1 row) The segmentation for a projection has changed from a static assignment at projection creation time to a dynamic assignment at session initialization. Also note that this is a K=1 fault tolerant database with no “buddy” projections – the fault tolerance is embedded in the shard subscription mechanism, not the projection mechanism. To see how fault tolerance works, let’s kill node 4 and re-run the query: => SELECT count(*) from t; count ------- 1000 (1 row) => SELECT storageid,node_name FROM dc_file_reads WHERE transaction_id = current_trans_id() AND statement_id = current_statement() - 1; storageid | node_name --------------------------------------------------+----------- 02919bece5be68827aaaf86cd60eb28400b000000001bac5 | node02 021ff4d5dfc925bc18d253c178453a9000c000000001bac5 | node03 0241ff05ab3785c430fa818f384bc2e100d000000001bac5 | node03 02806eada7df3163e2757e06a55d5fca00a000000001bac5 | node01 (4 rows) The file previously served by node04 was served by node03, with no change in the query plan.

By maintaining the segmentation mechanism even when using separated storage and compute, Vertica retains the ability to perform local joins. To demonstrate this behaviour, let’s create another table and look at the segmentation: => CREATE TABLE t2 (a int) SEGMENTED BY hash (a) all nodes; => SELECT projection_name,anchor_table_name,segment_expression,segment_range FROM projections; projection_name | anchor_table_name | segment_expression | segment_range -----------------+-------------------+--------------------+------------------------------------------------------------------------- T_b0 | T | hash(T.a) | implicit range: node02[25.0%] node03[25.0%] node04[25.0%] node01[25.0%] t2_b0 | t2 | hash(t2.a) | implicit range: node02[25.0%] node03[25.0%] node04[25.0%] node01[25.0%] The sharding mechanism is database-wide instead of by individual projection, ensuring that all segmented projections appear to have the same segmentation layout to the query optimizer. You can see that the segment ranges match, which enables a join between the two tables without a post-scan reshuffle of the records. In the explain plan below, note the lack of (RESEGMENT) in the JOIN operation: explain SELECT count(*) FROM t natural join t2; Access Path: +-GROUPBY NOTHING [Cost: 307, Rows: 1 (NO STATISTICS)] (PATH ID: 1) | Aggregates: count(*) | Execute on: All Nodes | +---> JOIN MERGEJOIN(inputs presorted) [Cost: 305, Rows: 10K (NO STATISTICS)] (PATH ID: 2) | | Join Cond: (T.a = t2.a) | | Execute on: All Nodes | | +-- Outer -> STORAGE ACCESS for T [Cost: 151, Rows: 10K (NO STATISTICS)] (PATH ID: 3) | | | Projection: public.T_b0 | | | Materialize: T.a | | | Execute on: All Nodes | | | Runtime Filter: (SIP1(MergeJoin): T.a) | | +-- Inner -> STORAGE ACCESS for t2 [Cost: 151, Rows: 10K (NO STATISTICS)] (PUSHED GROUPING) (PATH ID: 4) | | | Projection: public.t2_b0 | | | Materialize: t2.a | | | Execute on: All Nodes Eon mode provides the ability to distribute query processing across the cluster and to execute plans in the efficient manner for which Vertica is well known. In a later post, we’ll cover how Vertica achieves load balancing through subscriber selection.