Hive Tutorial -Part III

This is the 3rd post of Hive Training and in this we shall deep dive into some more concepts of Apache Hive.

Creating Views on Tables

Just like in relational database, we can create views in Hive.

CREATE VIEW IF NOT EXISTS emp_view 
AS
SELECT name, salary FROM employee 
WHERE salary =10000;

 

We can also create a view on complex queries, like a join condition of two tables:

CREATE VIEW IF NOT EXISTS emp_dept_view AS
SELECT * FROM 
   employee emp 
JOIN 
   department dep
 ON 
  (emp.deptid = dep.deptid) 
WHERE emp.salary = 10000;

To drop a view, we can run simple command like:

DROP VIEW

DROP VIEW IF EXISTS emp_dept_view;

 

How to find all views on a table in Hive

This is a tricky question and is often asked in interviews (at least I do ask ;))

The answer is that there is no direct way to do this. First, we need to remember that either we create the view in the same database(aka schema) as the base table or otherwise create a new database just for holding views.

Now, that we know which database has the view we need, use that database:

USE Financials;

Then, we need to make sure we name all views similarly, like “view_employee” or “emp_v” etc. Then we query all tables which includes views as well like:

SHOW TABLES LIKE “view*”;

When we find a view and need to see what is its base query etc, we can check by:

DESCRIBE FORMATTED view_employee;

As highlighted the original query used to create the view is shown.

or, we can run below query:

SHOW CREATE TABLE view_emp;

 

INDEXING

Indexes help in making our queries execute faster. We create index on certain columns in a table on which we will later base our “where” clause.

Without index, the query would otherwise check the entire partition or table.

Main disadvantages of Indexes:

  1. More disk space needed for indexes
  2. There is a processing cost for building the index

When we create an Index in Hive, what it does it creates a NEW TABLE. In this new table, it stores the Addresses of HDFS Blocks containing the values in the base table columns. This allows for faster query processing.

Steps to create Index:

This creates an index of type "compact".

WITH DEFERRED REBUILD instructs Hive to first create an empty index

ALTER INDEX index1_emp ON employee REBUILD;

This step will fire a MapReduce job and the index creation is now completed.

To find all indexes on a table:

SHOW FORMATTED INDEXES ON employee;

Please note that after an index has been created and later we load new data in base table, Hive will not index new data.

So we need to rebuild the index if we overwrite or add new data to the table.

Index is just another table and gets created in the same schema as the base table.

There are two Types of Indexes in Hive
Compact Indexing
Bitmap Indexing

The difference between compact and bitmap indexes is how they store the mapping from values to the rows in which the value occurs.
Compact: stores values, hdfs block ids as pairs
Bitmap: stores the same as bitmap image

Hive chooses its own name for index table, but optionally we can pass “IN TABLE our_index_table”. So new index table will have our name as we specify and can make it easier to rebuild index later.

Why indexes are not suggested in Hive

The Hive Query Execution Engine is rule based and not cost based. This means, that it may not use our index, and we would have to manually tweak our query to use the index.

 

JOINS

1. Inner Join

These are also known as direct join and are often used when we need a direct match between two columns of different tables. Here we look at “self join” on same table:

SELECT a.id, b.name 
FROM employees a 
  JOIN
employees b
    ON
(a.id = b.id)
WHERE A.ID > 5
AND B.NAME = 'AMIT';

Above is an example of self join, which is implemented using inner joins. (The example above is a dummy one and doesn’t really show the full usage of self join).

2. Joining More Than 2 Tables

SELECT a.name, b.dept_name, c.salary_group
FROM employee a 
  JOIN 
    department b 
        ON
   (a.id = b.id AND a.dept = b.dept)
  JOIN 
    salary C 
        ON
   (a.salary = c.salary)
WHERE a.id> 10
and c.salary = 10000

In the above case, Hive will use a MapReduce job for each table join. It will start from left, that is join a to b and then to c. It doesn’t go from right to left like Oracle.

What Hive does is that it assumes the last table we are specifying (c in this case) is a larger table than the other two. It will try to buffer the data of tables a and b. Therefore, always try to put largest table at the end only.

 

3. OUTER JOINS

These type of joins are useful where one of the tables has missing data on joining column.

a) LEFT OUTER JOIN
If the right side table has missing data in joining column, even those rows are returned from left side table. Example:

SELECT a.id, a.name, b.dept_name
FROM employee a 
LEFT OUTER JOIN department b 
ON (a.deptid = b.deptid AND a.salary = b.salary)
WHERE a.salary =10000;

b) Right Outer join

Works similarly but when we need all data from right side irrespective of if matching records exists in left side.

c)  FULL OUTER JOIN

Data from both sides appear even when there is no match either side.

WARNING
WHERE clauses are evaluated after joins are performed, so WHERE clauses should use predicates that only filter on column values that won’t be NULL.

 

d) Inner Query

Select * from employee a 
where a.deptid 
   IN
(select b.deptid from department b);

Above is a WRONG query and doesn’t work in Hive. Instead we need to use: “LEFT SEMI JOIN

    LEFT SEMI JOIN

SELECT * FROM employee a
LEFT SEMI JOIN 
department b 
   ON
a.deptid = s.deptid;


Note> we can't Select or use in Where clause the columns from the right hand table, department.

 

e) Cartesian Product

SELECT * FROM employee a JOIN department b;

 

The above will take a long time to run if dataset is huge. To stop users from issuing cartesian join queries:

hive> set hive.mapred.mode=strict

 

Nested Select Queries

> FROM (
     SELECT name, age, salary-tax AS net_salary
     FROM employee) a
 SELECT a.name, a.net_salary
 WHERE a.age > 50;

 

Leave a Reply

Your email address will not be published. Required fields are marked *