This is the 3rd post of Hive Training and in this we shall deep dive into some more concepts of Apache Hive.
Creating Views on Tables
Just like in relational database, we can create views in Hive.
CREATE VIEW IF NOT EXISTS emp_view AS SELECT name, salary FROM employee WHERE salary =10000;
We can also create a view on complex queries, like a join condition of two tables:
CREATE VIEW IF NOT EXISTS emp_dept_view AS SELECT * FROM employee emp JOIN department dep ON (emp.deptid = dep.deptid) WHERE emp.salary = 10000;
To drop a view, we can run simple command like:
DROP VIEW IF EXISTS emp_dept_view;
How to find all views on a table in Hive
This is a tricky question and is often asked in interviews (at least I do ask ;))
The answer is that there is no direct way to do this. First, we need to remember that either we create the view in the same database(aka schema) as the base table or otherwise create a new database just for holding views.
Now, that we know which database has the view we need, use that database:
Then, we need to make sure we name all views similarly, like “view_employee” or “emp_v” etc. Then we query all tables which includes views as well like:
SHOW TABLES LIKE “view*”;
When we find a view and need to see what is its base query etc, we can check by:
DESCRIBE FORMATTED view_employee;
As highlighted the original query used to create the view is shown.
or, we can run below query:
SHOW CREATE TABLE view_emp;
Indexes help in making our queries execute faster. We create index on certain columns in a table on which we will later base our “where” clause.
Without index, the query would otherwise check the entire partition or table.
Main disadvantages of Indexes:
- More disk space needed for indexes
- There is a processing cost for building the index
When we create an Index in Hive, what it does it creates a NEW TABLE. In this new table, it stores the Addresses of HDFS Blocks containing the values in the base table columns. This allows for faster query processing.
Steps to create Index:
This creates an index of type "compact".
WITH DEFERRED REBUILD instructs Hive to first create an empty index
ALTER INDEX index1_emp ON employee REBUILD;
This step will fire a MapReduce job and the index creation is now completed.
To find all indexes on a table:
SHOW FORMATTED INDEXES ON employee;
Please note that after an index has been created and later we load new data in base table, Hive will not index new data.
So we need to rebuild the index if we overwrite or add new data to the table.
Index is just another table and gets created in the same schema as the base table.
There are two Types of Indexes in Hive
The difference between compact and bitmap indexes is how they store the mapping from values to the rows in which the value occurs.
Compact: stores values, hdfs block ids as pairs
Bitmap: stores the same as bitmap image
Hive chooses its own name for index table, but optionally we can pass “IN TABLE our_index_table”. So new index table will have our name as we specify and can make it easier to rebuild index later.
Why indexes are not suggested in Hive
The Hive Query Execution Engine is rule based and not cost based. This means, that it may not use our index, and we would have to manually tweak our query to use the index.
1. Inner Join
These are also known as direct join and are often used when we need a direct match between two columns of different tables. Here we look at “self join” on same table:
SELECT a.id, b.name FROM employees a JOIN employees b ON (a.id = b.id) WHERE A.ID > 5 AND B.NAME = 'AMIT';
Above is an example of self join, which is implemented using inner joins. (The example above is a dummy one and doesn’t really show the full usage of self join).
2. Joining More Than 2 Tables
SELECT a.name, b.dept_name, c.salary_group FROM employee a JOIN department b ON (a.id = b.id AND a.dept = b.dept) JOIN salary C ON (a.salary = c.salary) WHERE a.id> 10 and c.salary = 10000
In the above case, Hive will use a MapReduce job for each table join. It will start from left, that is join a to b and then to c. It doesn’t go from right to left like Oracle.
What Hive does is that it assumes the last table we are specifying (c in this case) is a larger table than the other two. It will try to buffer the data of tables a and b. Therefore, always try to put largest table at the end only.
3. OUTER JOINS
These type of joins are useful where one of the tables has missing data on joining column.
a) LEFT OUTER JOIN
If the right side table has missing data in joining column, even those rows are returned from left side table. Example:
SELECT a.id, a.name, b.dept_name FROM employee a LEFT OUTER JOIN department b ON (a.deptid = b.deptid AND a.salary = b.salary) WHERE a.salary =10000;
b) Right Outer join
Works similarly but when we need all data from right side irrespective of if matching records exists in left side.
c) FULL OUTER JOIN
Data from both sides appear even when there is no match either side.
WHERE clauses are evaluated after joins are performed, so WHERE clauses should use predicates that only filter on column values that won’t be NULL.
d) Inner Query
Select * from employee a where a.deptid IN (select b.deptid from department b);
Above is a WRONG query and doesn’t work in Hive. Instead we need to use: “LEFT SEMI JOIN”
LEFT SEMI JOIN
SELECT * FROM employee a LEFT SEMI JOIN department b ON a.deptid = s.deptid; Note> we can't Select or use in Where clause the columns from the right hand table, department.
e) Cartesian Product
SELECT * FROM employee a JOIN department b;
The above will take a long time to run if dataset is huge. To stop users from issuing cartesian join queries:
hive> set hive.mapred.mode=strict
Nested Select Queries
> FROM ( SELECT name, age, salary-tax AS net_salary FROM employee) a SELECT a.name, a.net_salary WHERE a.age > 50;