Hive on EMR Serverless

DataLake, Glue, 2024

image

AWS EMR Serverless is magic to launch Hive. I remember spending a week mnay years back to get Hive working on my local machine. This is so easy on AWS, it is cheating!

Pre-setup

Create EMR Access Role

Create IAM role for the EMR using custom trust policies

{
    "Version": "2008-10-17",
    "Statement": [
        {
            "Sid": "",
            "Effect": "Allow",
            "Principal": {
                "Service": "elasticmapreduce.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}

Attach Policies

  • AmazonElasticMapReduceEditorsRole
  • AmazonS3FullAccess

Create EMR Servlerless Execution Role

Create a Policy for the role: KFNStudyEMRServlerlessExecutionRolePolicies

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "ReadAccessForEMRSamples",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::*.elasticmapreduce",
                "arn:aws:s3:::*.elasticmapreduce/*"
            ]
        },
        {
            "Sid": "FullAccessToOutputBucket",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:ListBucket",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::*",
                "arn:aws:s3:::*/*"
            ]
        },
        {
            "Sid": "GlueCreateAndReadDataCatalog",
            "Effect": "Allow",
            "Action": [
                "glue:GetDatabase",
                "glue:CreateDatabase",
                "glue:GetDataBases",
                "glue:CreateTable",
                "glue:GetTable",
                "glue:UpdateTable",
                "glue:DeleteTable",
                "glue:GetTables",
                "glue:GetPartition",
                "glue:GetPartitions",
                "glue:CreatePartition",
                "glue:BatchCreatePartition",
                "glue:GetUserDefinedFunctions"
            ],
            "Resource": [
                "*"
            ]
        }
    ]
}

Create IAM role using custom trust policie: KFNStudyEMRServlerlessExecutionRole

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "emr-serverless.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}

Create a bucket for Hive DB

Bucket Name: emr-hive-serverless-kfn-study Folder: hive/scripts, hive/lake/employee

EMR Servless Studio Setup

I am creating a default Hive application: image

Create DB and Insert Data

-- create database
CREATE DATABASE IF NOT EXISTS emrdb;

-- create table; 
CREATE EXTERNAL TABLE emrdb.employee
    (
    `id` 	INT, 	
    `name` 	STRING,	
     `salary`   STRING
    )
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3://emr-hive-serverless-kfn-study/inputdata/'
TBLPROPERTIES ('skip.header.line.count'='1')
;

The DB Scripts are saved in a folder which needs to be defined in the job. image

Sample data

Updaded data in the above location: s3://emr-hive-serverless-kfn-study/inputdata/

image

Create a job

  1. Attach the execution role we had created.
  2. Provide the job configuration
{
    "applicationConfiguration": [
        {
            "classification": "hive-site",
            "configurations": [],
            "properties": {
                "hive.exec.scratchdir": "s3://emr-hive-serverless-kfn-study/hive/datalake/employee/scratch",
                "hive.metastore.warehouse.dir": "s3://emr-hive-serverless-kfn-study/hive/datalake/employee/data"
            }
        }
    ],
    "monitoringConfiguration": {}
}

image

Validate Job

Validating the data in the S3. image

The Table is available in AWS Glue image

The data can additionally queried in Athena image