Hive on EMR Serverless
DataLake, Glue, 2024
AWS EMR Serverless is magic to launch Hive. I remember spending a week mnay years back to get Hive working on my local machine. This is so easy on AWS, it is cheating!
Pre-setup
Create EMR Access Role
Create IAM role for the EMR using custom trust policies
{
"Version": "2008-10-17",
"Statement": [
{
"Sid": "",
"Effect": "Allow",
"Principal": {
"Service": "elasticmapreduce.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}
Attach Policies
- AmazonElasticMapReduceEditorsRole
- AmazonS3FullAccess
Create EMR Servlerless Execution Role
Create a Policy for the role: KFNStudyEMRServlerlessExecutionRolePolicies
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "ReadAccessForEMRSamples",
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::*.elasticmapreduce",
"arn:aws:s3:::*.elasticmapreduce/*"
]
},
{
"Sid": "FullAccessToOutputBucket",
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:ListBucket",
"s3:DeleteObject"
],
"Resource": [
"arn:aws:s3:::*",
"arn:aws:s3:::*/*"
]
},
{
"Sid": "GlueCreateAndReadDataCatalog",
"Effect": "Allow",
"Action": [
"glue:GetDatabase",
"glue:CreateDatabase",
"glue:GetDataBases",
"glue:CreateTable",
"glue:GetTable",
"glue:UpdateTable",
"glue:DeleteTable",
"glue:GetTables",
"glue:GetPartition",
"glue:GetPartitions",
"glue:CreatePartition",
"glue:BatchCreatePartition",
"glue:GetUserDefinedFunctions"
],
"Resource": [
"*"
]
}
]
}
Create IAM role using custom trust policie: KFNStudyEMRServlerlessExecutionRole
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "emr-serverless.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}
Create a bucket for Hive DB
Bucket Name: emr-hive-serverless-kfn-study Folder: hive/scripts, hive/lake/employee
EMR Servless Studio Setup
I am creating a default Hive application:
Create DB and Insert Data
-- create database
CREATE DATABASE IF NOT EXISTS emrdb;
-- create table;
CREATE EXTERNAL TABLE emrdb.employee
(
`id` INT,
`name` STRING,
`salary` STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3://emr-hive-serverless-kfn-study/inputdata/'
TBLPROPERTIES ('skip.header.line.count'='1')
;
The DB Scripts are saved in a folder which needs to be defined in the job.
Sample data
Updaded data in the above location: s3://emr-hive-serverless-kfn-study/inputdata/
Create a job
- Attach the execution role we had created.
- Provide the job configuration
{
"applicationConfiguration": [
{
"classification": "hive-site",
"configurations": [],
"properties": {
"hive.exec.scratchdir": "s3://emr-hive-serverless-kfn-study/hive/datalake/employee/scratch",
"hive.metastore.warehouse.dir": "s3://emr-hive-serverless-kfn-study/hive/datalake/employee/data"
}
}
],
"monitoringConfiguration": {}
}
Validate Job
Validating the data in the S3.
The Table is available in AWS Glue
The data can additionally queried in Athena