Hive Optimization Techniques
Introduction:
Hive is like SQL software used with Hadoop to give users the capability of performing SQL-like queries on its own language, HiveQL, quickly and efficiently. It also gives query analytical abilities to users which are not available on traditional SQL structures. There are several types of Hive Optimization techniques available while running our hive queries to improve Hive performance.
Query Optimization Techniques in Hive:
- Tez Execution Engine
- Hive Partitioning
- Hive Bucketing
- Vectorization
- Map join
- File Formats in Hive
- Cost-Based Optimization
Tez Execution Engine:
Tez is a new application framework built on Hadoop Yarn which executes complex directed acyclic graphs of general data processing tasks. By using Tez as our execution engine one can increase the performance of Hive. However, we can consider it to be a much more flexible and powerful successor to the map-reduce framework.
Hive Partitioning:
Hive reads all the data in the form of directory without partitioning. Further, it applies the query filters on all data which causes slowness and also expensive. Hive partitioning is an effective method to improve the query performance on larger tables. Partitioning allows you to store data in separate sub-directories under table location. It greatly helps the queries which are queried upon the partition key(s). Usually, users need to filter the data on specific column values frequently. Although the selection of partition keys is always a key decision,e.g. if your data is associated with time dimension, then date could be a good partition key. Similarly, if data has association with location, like a country or state, then it’s a good idea to have hierarchical partitions like country/state.
Hive Bucketing:
Bucketing improves the join performance if the bucket key and join keys are common. Bucketing in Hive distributes the data in different buckets based on the hash results on the bucket key. It also reduces the I/O scans during the join process if the process is happening on the same keys (columns).
To enable bucketing, the following flag needs to be set to true before writing data to the bucketed table.
SET hive.enforce.bucketing=true;
To leverage the bucketing in the join operation we should set the following flag to true.
SET hive.optimize.bucketmapjoin=true;
This makes Hive to do bucket level join during the map stage join. It also reduces the scan cycles to find a particular key because bucketing ensures that the key is present in a certain bucket.
Vectorization:
Vectorization allows Hive to process a batch of rows together instead of processing one row at a time. To improve the performance of operations we use Vectorized query execution. Here operations refer to scans, aggregations, filters, and joins. It happens by performing them in batches of 1024 rows at once instead of single row each time. This feature is introduced in Hive 0.13. It significantly improves query execution time, and is easily enabled with two parameters settings:
set hive.vectorized.execution = true
set hive.vectorized.execution.enabled = true
Map join:
There is a major issue with Joins. As there is too much activity spending on shuffling data as a result, it slows the Hive Queries. Hence, to speed up the Hive queries, we can use Map Join in Hive. Also, we use Hive Map Side Join when one of the tables in the join is a small table and can be loaded into memory. So that a join could be performed within a mapper without using a Map/Reduce step.
To perform Map Side Join following properties needs to set to true:
a. hive.auto.convert.join = true; By default this option is set to true. When a table with a size less than 25 MB is found, then the joins are converted to map-based joins.
b. hive.auto.convert.join.noconditionaltask= true;
When there comes a scenario while three or more tables are involved in the join condition. Further, Hive generates three or more map-side joins with an assumption that all tables are of smaller size by using hive.auto.convert.join. Moreover, we can combine three or more map-side joins into a single map-side join if the size of the n-1 table is less than 10 MB by using hive.auto.convert.join.noconditionaltask.
Syntax for Mapjoin:
Select /*+ MAPJOIN(b) */ a.key, a.value from a join b on a.key = b.key
Here b is a small table that will fit into memory.
File Formats in Hive:
If we use appropriate file format on the basis of data, it will drastically increase our query performance. Basically, for increasing your query performance ORC file format is best suitable file format. Here, ORC refers to Optimized Row Columnar. That implies we can store data in an optimized way than the other file formats.ORC reduces the size of the original data up to 75%. Hence, data processing speed also increases as compared to Text, Sequence and RC file formats.
Cost-Based Optimization:
Hive optimizes each Query’s logical and physical execution plan. Although, until now these optimizations are not based on the cost of the query but CBO performs further optimizations based on query cost which results as follows: how to order joins, which type of join to perform, the degree of parallelism and others.
To use CBO, need to set the following parameters at the beginning of your query:
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
Conclusion:
By using above discussed techniques, we can optimize hive queries for faster execution. This is the basic idea about how to optimize your hive queries in order to get best results in terms of performance.