In this article I am going to explain the managed tables vs external tables in Apache Hive. This question is prominently featured in the interviews. So, if you want clarity on this aspect, this article is for you.
So, lets dive into deep in the discussion.
Apache Hive is a warehouse tool in distributed HDFS environment which is used to store huge dataset. You can use data ingestion tools to ingest dataset from variety of platform to Hive warehouse. Apache hive support two type of tables:
1. Managed or Internal table
2. External table
1. Managed or internal table:
When you create a table in Apache hive, by default it is treated as managed or internal table. If you want to create an external table, you will have to use “external” keyword explicitly. By default, Hive stores the managed table in the warehouse folder under hive. In case of managed table, schema (meta data) and the data included in the tables are managed by Hive. You can perform HQL operation such as insert, update, and delete on the managed table. But when you drop managed table, the data along with schema will also get deleted.
For example, I have created a managed table iris_tab_managed in the students database. The see below given snippet:
So, by default iris_tab_managed is created in the warehouse folder under Hive. I have upload data in the managed table (see below given snippet).
So, When I load data in the iris_tab_managed table it will be stored in warehouse folder under studets.db/iris_tab_managed location. The see below given snippet:
When I execute drop table iris_tab_managed command it will drop data as well as meta data of iris_tab_managed table. The see below given snippet.
You can create external table using external keyword and you can also specify location of the dataset in HDFS. See below given snippet.
To confirm whether a table is managed or external you can run following command
>describe formatted iris_tab_managed;
External tables can be stored at any location in the HDFS. Hive only manages meta data of the external table. In case of external table only schema is get deleted but data remains intact in the underlaying location.
From the above discussion, it is super simple to understand that managed table and external table is quite different and used for different purpose. In case of managed table, schema and data is managed by Hive whereas to create external table you need to use external keyword in create table statement and you can also specify location of the dataset. In external table, Hive only manage meta data and data can be stored at any location in HDFS. Managed tables follow ACID properties whereas you cannot perform update and delete operations on external tables. Insert operation is slow in managed operation whereas it is fast in external tables. So, I am sure this stuff will clear your doubts about managed or internal tables VS managed tabled in Apache Hadoop.