Skip to content

Commit

Permalink
Merge pull request #15 from hbarros-caylent/CA-33
Browse files Browse the repository at this point in the history
CA-33 adds examples for using ABAC in EMR
  • Loading branch information
hbarros-caylent authored Aug 10, 2021
2 parents e35b47b + 322f867 commit d5926ed
Show file tree
Hide file tree
Showing 21 changed files with 388 additions and 160 deletions.
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
# AWS Tamr Config Repo

## v2.2.0 - August 10th 2021
* Adds variable `emr_tags` that is passed to the config `TAMR_DATASET_EMR_CLUSTER_TAGS` into the file
* Updates examples to use new versions of modules

## v2.1.0 - July 2nd 2021
* Add HBase properties including `SHARED` mode to default Tamr config

Expand Down
5 changes: 3 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,12 +56,13 @@ This module creates:
| emr\_service\_access\_sg\_id | Security group ID of EMR Service Access Security Group. | `string` | `""` | no |
| emr\_service\_role\_name | Name of IAM service role for EMR cluster. | `string` | `""` | no |
| emr\_subnet\_id | ID of the subnet where the EMR cluster will be created. | `string` | `""` | no |
| emr\_tags | Map of tags to add to new resources in EMR | `map(string)` | `{}` | no |
| emrfs\_dynamodb\_table\_name | Name for the EMRFS DynamoDB table. | `string` | `""` | no |
| hbase\_config\_path | Path to HBase configuration in EMR root directory bucket. | `string` | `"config/hbase/conf.dist/"` | no |
| hbase\_namespace | n/a | `string` | `"tamr"` | no |
| hbase\_storage\_mode | Storage mode for HBase. Valid values: `SHARED`, `DEDICATED` | `string` | `"SHARED"` | no |
| hbase\_number\_of\_regions | Number of regions to create by default in HBase | `string` | `"1000"` | no |
| hbase\_number\_of\_salt\_values | Number of distinct salt values to be used for prefixing row keys in HBase tables. Must be >= hbase_number_of_regions | `string` | `"1000"` | no |
| hbase\_number\_of\_salt\_values | Number of distinct salt values to be used for prefixing row keys in HBase tables. Must be >= hbase\_number\_of\_regions | `string` | `"1000"` | no |
| hbase\_storage\_mode | Storage mode for HBase. Valid values: `SHARED`, `DEDICATED` | `string` | `"SHARED"` | no |
| master\_ebs\_size | The master EBS volume size, in gibibytes (GiB). | `string` | `""` | no |
| master\_ebs\_type | Type of volumes to attach to the master nodes. Valid options are gp2, io1, standard and st1. | `string` | `""` | no |
| master\_ebs\_volumes\_count | Number of volumes to attach to the master nodes. | `string` | `""` | no |
Expand Down
2 changes: 1 addition & 1 deletion VERSION
Original file line number Diff line number Diff line change
@@ -1 +1 @@
2.1.0
2.2.0
3 changes: 3 additions & 0 deletions examples/ephemeral-spark/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,9 +29,12 @@ No requirements.
| license\_key | Tamr license key | `string` | n/a | yes |
| rds\_subnet\_group\_ids | List of at least 2 subnet IDs in different AZs | `list(string)` | n/a | yes |
| vpc\_id | VPC ID of deployment | `string` | n/a | yes |
| emr\_abac\_valid\_tags | Valid tags for maintaining resources when using ABAC IAM Policies with Tag Conditions. Make sure `emr_tags` contain the values specified here and that your Subnet is tagged as well | `map(list(string))` | `{}` | no |
| emr\_tags | Map of tags to add to EMR resources. They must contain abac\_valid\_tags at minimum | `map(string)` | `{}` | no |
| ingress\_cidr\_blocks | List of CIDR blocks from which ingress to ElasticSearch domain, Tamr VM, Tamr Postgres instance are allowed (i.e. VPN CIDR) | `list(string)` | `[]` | no |
| name\_prefix | A prefix to add to the names of all created resources. | `string` | `"tamr-config-test"` | no |
| path\_to\_spark\_logs | Path in logs bucket to store spark logs. E.g. tamr/spark-logs | `string` | `""` | no |
| tags | Map of tags to add to resources. | `map(string)` | `{}` | no |

## Outputs

Expand Down
39 changes: 26 additions & 13 deletions examples/ephemeral-spark/elasticsearch.tf
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
module "tamr-es-cluster" {
source = "git::[email protected]:Datatamer/terraform-aws-es?ref=1.0.1"
source = "git::[email protected]:Datatamer/terraform-aws-es?ref=2.1.0"

# Names
domain_name = "${var.name_prefix}-es"
Expand All @@ -13,18 +13,31 @@ module "tamr-es-cluster" {
enforce_https = true

# Networking
vpc_id = var.vpc_id
subnet_ids = [var.ec2_subnet_id]
security_group_ids = [
// Spark
module.ephemeral-spark-sgs.emr_service_access_sg_id,
module.ephemeral-spark-sgs.emr_managed_master_sg_id,
module.ephemeral-spark-sgs.emr_additional_master_sg_id,
module.ephemeral-spark-sgs.emr_managed_core_sg_id,
module.ephemeral-spark-sgs.emr_additional_core_sg_id,
// VM
module.tamr-vm.tamr_security_groups["tamr_security_group_id"]
]
vpc_id = var.vpc_id
subnet_ids = [var.ec2_subnet_id]
security_group_ids = module.aws-sg-es.security_group_ids
# CIDR blocks to allow ingress from (i.e. VPN)
ingress_cidr_blocks = var.ingress_cidr_blocks
aws_region = data.aws_region.current.name
}

data "aws_region" "current" {}

# Security Groups
module "sg-ports-es" {
source = "git::[email protected]:Datatamer/terraform-aws-es.git//modules/es-ports?ref=2.1.0"
}

module "aws-sg-es" {
source = "git::[email protected]:Datatamer/terraform-aws-security-groups.git?ref=1.0.0"
vpc_id = var.vpc_id
ingress_cidr_blocks = var.ingress_cidr_blocks
egress_cidr_blocks = [
"0.0.0.0/0"
]
ingress_ports = module.sg-ports-es.ingress_ports
sg_name_prefix = format("%s-%s", var.name_prefix, "-es")
tags = var.tags
ingress_protocol = "tcp"
egress_protocol = "all"
}
19 changes: 8 additions & 11 deletions examples/ephemeral-spark/ephemeral-spark.tf
Original file line number Diff line number Diff line change
@@ -1,17 +1,13 @@
# Ephemeral Spark cluster
module "ephemeral-spark-sgs" {
source = "git::[email protected]:Datatamer/terraform-aws-emr.git//modules/aws-emr-sgs?ref=3.0.0"
applications = ["Spark"]
vpc_id = var.vpc_id
emr_managed_master_sg_name = "${var.name_prefix}-EMR-Spark-Master"
emr_managed_core_sg_name = "${var.name_prefix}-EMR-Spark-Core"
emr_additional_master_sg_name = "${var.name_prefix}-EMR-Spark-Additional-Master"
emr_additional_core_sg_name = "${var.name_prefix}-EMR-Spark-Additional-Core"
emr_service_access_sg_name = "${var.name_prefix}-EMR-Spark-Service-Access"
source = "git::[email protected]:Datatamer/terraform-aws-emr.git//modules/aws-emr-sgs?ref=6.1.0"
vpc_id = var.vpc_id
emr_managed_sg_name = format("%s-%s", var.name_prefix, "Ephem-Spark-Internal")
tags = merge(var.tags, var.emr_tags)
}

module "ephemeral-spark-iam" {
source = "git::[email protected]:Datatamer/terraform-aws-emr.git//modules/aws-emr-iam?ref=3.0.0"
source = "git::[email protected]:Datatamer/terraform-aws-emr.git//modules/aws-emr-iam?ref=6.1.0"
s3_bucket_name_for_logs = module.s3-logs.bucket_name
s3_bucket_name_for_root_directory = module.s3-data.bucket_name
s3_policy_arns = [
Expand All @@ -23,12 +19,13 @@ module "ephemeral-spark-iam" {
emr_service_role_name = "${var.name_prefix}-spark-service-role"
emr_ec2_instance_profile_name = "${var.name_prefix}-spark-emr-instance-profile"
emr_ec2_role_name = "${var.name_prefix}-spark-ec2-role"
tags = var.tags
}

module "ephemeral-spark-config" {
source = "git::[email protected]:Datatamer/terraform-aws-emr.git//modules/aws-emr-config?ref=3.0.0"
source = "git::[email protected]:Datatamer/terraform-aws-emr.git//modules/aws-emr-config?ref=6.1.0"
create_static_cluster = false
cluster_name = "${var.name_prefix}-Spark-Cluster" # unused
cluster_name = "" # unused
emr_config_file_path = "./emr.json"
bucket_name_for_root_directory = module.s3-data.bucket_name
}
92 changes: 66 additions & 26 deletions examples/ephemeral-spark/hbase-cluster.tf
Original file line number Diff line number Diff line change
@@ -1,25 +1,26 @@
locals {
applications = ["Hbase", "Ganglia"]
}
# EMR Static HBase cluster
module "emr-hbase" {
source = "[email protected]:Datatamer/terraform-aws-emr.git?ref=3.0.0"
source = "[email protected]:Datatamer/terraform-aws-emr.git?ref=6.1.0"

# Configurations
create_static_cluster = true
release_label = "emr-5.29.0" # hbase 1.4.10
applications = ["Hbase", "Ganglia"]
applications = local.applications
emr_config_file_path = "./emr.json"
additional_tags = {}
enable_http_port = true
bucket_path_to_logs = "logs/${var.name_prefix}-hbase"
tags = merge(var.tags, var.emr_tags)
abac_valid_tags = var.emr_abac_valid_tags

# Networking
subnet_id = var.ec2_subnet_id
vpc_id = var.vpc_id
tamr_cidrs = var.ingress_cidr_blocks
tamr_sgs = [
module.tamr-vm.tamr_security_groups["tamr_security_group_id"],
module.tamr-es-cluster.es_security_group_id,
module.rds-postgres.rds_sg_id
]
subnet_id = var.ec2_subnet_id
vpc_id = var.vpc_id
# Security Group IDs
emr_managed_master_sg_ids = module.aws-emr-sg-master.security_group_ids
emr_managed_core_sg_ids = module.aws-emr-sg-core.security_group_ids
emr_service_access_sg_ids = module.aws-emr-sg-service-access.security_group_ids

# External resource references
bucket_name_for_root_directory = module.s3-data.bucket_name
Expand All @@ -37,20 +38,59 @@ module "emr-hbase" {
emr_ec2_instance_profile_name = "${var.name_prefix}-hbase-emr-instance-profile"
emr_service_iam_policy_name = "${var.name_prefix}-hbase-service-policy"
emr_ec2_iam_policy_name = "${var.name_prefix}-hbase-ec2-policy"
master_instance_group_name = "${var.name_prefix}-HBaseMasterInstanceGroup"
core_instance_group_name = "${var.name_prefix}-HBaseCoreInstanceGroup"
emr_managed_master_sg_name = "${var.name_prefix}-EMR-HBase-Master"
emr_managed_core_sg_name = "${var.name_prefix}-EMR-HBase-Core"
emr_additional_master_sg_name = "${var.name_prefix}-EMR-HBase-Additional-Master"
emr_additional_core_sg_name = "${var.name_prefix}-EMR-HBase-Additional-Core"
emr_service_access_sg_name = "${var.name_prefix}-EMR-HBase-Service-Access"
master_instance_fleet_name = "${var.name_prefix}-HBaseMasterInstanceGroup"
core_instance_fleet_name = "${var.name_prefix}-HBaseCoreInstanceGroup"
emr_managed_sg_name = "${var.name_prefix}-EMR-Managed"

# Scale
master_group_instance_count = 1
core_group_instance_count = 4
master_instance_type = "m4.large"
core_instance_type = "r5.4xlarge"
master_ebs_size = 50
core_ebs_size = 200
core_bid_price = "1.260" # r5.4xlarge on emr -> $1.008 + $0.252 = $1.260
master_instance_on_demand_count = 1
core_instance_on_demand_count = 4
# core_instance_spot_count = 4
# core_bid_price_as_percentage_of_on_demand_price = 100
master_instance_type = "m4.large"
core_instance_type = "r5.xlarge"
master_ebs_size = 50
core_ebs_size = 200
}

module "sg-ports-emr" {
source = "git::[email protected]:Datatamer/terraform-aws-emr.git//modules/aws-emr-ports?ref=6.1.0"

applications = local.applications
}

module "aws-emr-sg-master" {
source = "git::[email protected]:Datatamer/terraform-aws-security-groups.git?ref=1.0.0"
vpc_id = var.vpc_id
ingress_cidr_blocks = var.ingress_cidr_blocks
egress_cidr_blocks = var.egress_cidr_blocks
ingress_ports = module.sg-ports-emr.ingress_master_ports
sg_name_prefix = format("%s-%s", var.name_prefix, "emr-master")
egress_protocol = "all"
ingress_protocol = "tcp"
tags = merge(var.tags, var.emr_tags)
}

module "aws-emr-sg-core" {
source = "git::[email protected]:Datatamer/terraform-aws-security-groups.git?ref=1.0.0"
vpc_id = var.vpc_id
ingress_cidr_blocks = var.ingress_cidr_blocks
egress_cidr_blocks = var.egress_cidr_blocks
ingress_ports = module.sg-ports-emr.ingress_core_ports
sg_name_prefix = format("%s-%s", var.name_prefix, "emr-core")
egress_protocol = "all"
ingress_protocol = "tcp"
tags = merge(var.tags, var.emr_tags)
}

module "aws-emr-sg-service-access" {
source = "git::[email protected]:Datatamer/terraform-aws-security-groups.git?ref=1.0.0"
vpc_id = var.vpc_id
ingress_cidr_blocks = var.ingress_cidr_blocks
egress_cidr_blocks = var.egress_cidr_blocks
ingress_ports = module.sg-ports-emr.ingress_service_access_ports
sg_name_prefix = format("%s-%s", var.name_prefix, "emr-service-access")
egress_protocol = "all"
ingress_protocol = "tcp"
tags = merge(var.tags, var.emr_tags)
}
32 changes: 19 additions & 13 deletions examples/ephemeral-spark/rds.tf
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ resource "random_password" "rds-password" {
}

module "rds-postgres" {
source = "git::[email protected]:Datatamer/terraform-aws-rds-postgres.git?ref=1.0.0"
source = "git::[email protected]:Datatamer/terraform-aws-rds-postgres.git?ref=3.0.0"

identifier_prefix = "${var.name_prefix}-"
username = "tamr"
Expand All @@ -14,21 +14,27 @@ module "rds-postgres" {
subnet_group_name = "${var.name_prefix}-subnet-group"
postgres_name = "tamr0"
parameter_group_name = "${var.name_prefix}-rds-postgres-pg"
security_group_name = "${var.name_prefix}-sg"

vpc_id = var.vpc_id
# Network requirement: DB subnet group needs a subnet in at least two AZs
rds_subnet_ids = var.rds_subnet_group_ids

ingress_sg_ids = [
// Spark
module.ephemeral-spark-sgs.emr_service_access_sg_id,
module.ephemeral-spark-sgs.emr_managed_master_sg_id,
module.ephemeral-spark-sgs.emr_additional_master_sg_id,
module.ephemeral-spark-sgs.emr_managed_core_sg_id,
module.ephemeral-spark-sgs.emr_additional_core_sg_id,
// Tamr VM
module.tamr-vm.tamr_security_groups["tamr_security_group_id"]
]
additional_cidrs = var.ingress_cidr_blocks
security_group_ids = module.rds-postgres-sg.security_group_ids
tags = var.tags
}

module "sg-ports-rds" {
source = "git::[email protected]:Datatamer/terraform-aws-rds-postgres.git//modules/rds-postgres-ports?ref=3.0.0"
}

module "rds-postgres-sg" {
source = "git::[email protected]:Datatamer/terraform-aws-security-groups.git?ref=1.0.0"
vpc_id = var.vpc_id
ingress_cidr_blocks = var.ingress_cidr_blocks
egress_cidr_blocks = var.egress_cidr_blocks
ingress_ports = module.sg-ports-rds.ingress_ports
sg_name_prefix = var.name_prefix
egress_protocol = "all"
ingress_protocol = "tcp"
tags = var.tags
}
60 changes: 34 additions & 26 deletions examples/ephemeral-spark/tamr-config.tf
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
module "tamr-config" {
# source = "git::[email protected]:Datatamer/terraform-aws-tamr-config?ref=2.0.0"
# source = "git::[email protected]:Datatamer/terraform-aws-tamr-config?ref=2.1.0"
source = "../.."

config_template_path = "../../tamr-config.yml"
Expand All @@ -8,6 +8,7 @@ module "tamr-config" {
additional_templated_variables = {
"TAMR_LICENSE_KEY" : var.license_key
}
emr_tags = var.emr_tags

# Backup
tamr_backup_emr_cluster_id = module.emr-hbase.tamr_emr_cluster_id
Expand All @@ -24,39 +25,46 @@ module "tamr-config" {
tamr_data_bucket = module.s3-data.bucket_name
hbase_config_path = module.emr-hbase.hbase_config_path

# ElasticSearch
es_domain_endpoint = module.tamr-es-cluster.tamr_es_domain_endpoint

# ESP
tamr_external_storage_providers = "[{'name' : 's3a_tamr_config_test','description' : 'The S3a filesystem at root of ${module.s3-data.bucket_name}','uri' : 's3a://${module.s3-data.bucket_name}/'}]"

# Spark
spark_emr_cluster_id = ""
spark_cluster_log_uri = "s3n://${module.s3-logs.bucket_name}/${var.path_to_spark_logs}"
spark_driver_memory = "5G"
spark_executor_instances = 15
spark_executor_memory = "6G"
spark_executor_cores = 1
tamr_data_path = "tamr/unify-data"
tamr_spark_config_override = "[{'name' : 'sparkOverride1','executorInstances' : '2','sparkProps' : {'spark.cores.max' : '4'}},{'name' : 'sparkOverride2','driverMemory' : '4G','executorMemory' : '5G'}]"
tamr_spark_properties_override = "{'spark.driver.maxResultSize':'4g'}"

# ElasticSearch
es_domain_endpoint = module.tamr-es-cluster.tamr_es_domain_endpoint

# ESP
tamr_external_storage_providers = "[{'name' : 's3a_tamr_config_test','description' : 'The S3a filesystem at root of ${module.s3-data.bucket_name}','uri' : 's3a://${module.s3-data.bucket_name}/'}]"

# Ephemeral Spark
emr_release_label = "emr-5.29.0" # spark 2.4.4
emr_instance_profile_name = module.ephemeral-spark-iam.emr_ec2_instance_profile_name
emr_service_role_name = module.ephemeral-spark-iam.emr_service_role_name
emr_key_pair_name = module.emr_key_pair.key_pair_key_name
emr_subnet_id = var.ec2_subnet_id
master_instance_type = "m4.large"
master_ebs_volumes_count = 1
master_ebs_size = 50
master_ebs_type = "gp2"
core_ebs_volumes_count = 1
core_ebs_size = 200
core_ebs_type = "gp2"
core_group_instance_count = 4
core_instance_type = "r5.4xlarge"
emr_service_access_sg_id = module.ephemeral-spark-sgs.emr_service_access_sg_id
emr_managed_master_sg_id = module.ephemeral-spark-sgs.emr_managed_master_sg_id
emr_additional_master_sg_id = module.ephemeral-spark-sgs.emr_additional_master_sg_id
emr_managed_core_sg_id = module.ephemeral-spark-sgs.emr_managed_core_sg_id
emr_additional_core_sg_id = module.ephemeral-spark-sgs.emr_additional_core_sg_id
emr_release_label = "emr-5.29.0" # spark 2.4.4
emr_instance_profile_name = module.ephemeral-spark-iam.emr_ec2_instance_profile_name
emr_service_role_name = module.ephemeral-spark-iam.emr_service_role_name
emr_key_pair_name = module.emr_key_pair.key_pair_key_name
emr_subnet_id = var.ec2_subnet_id
master_instance_type = "m4.large"
master_ebs_volumes_count = 1
master_ebs_size = 50
master_ebs_type = "gp2"
core_ebs_volumes_count = 1
core_ebs_size = 200
core_ebs_type = "gp2"
core_group_instance_count = 4
core_instance_type = "r5.xlarge"

emr_managed_master_sg_id = module.ephemeral-spark-sgs.emr_managed_sg_id
# emr_managed_master_sg_id = "" # you may leave this blank and AWS creates one automatically
emr_additional_master_sg_id = join(",", module.aws-emr-sg-core.security_group_ids)
emr_managed_core_sg_id = module.ephemeral-spark-sgs.emr_managed_sg_id
# emr_managed_core_sg_id = "" # you may leave this blank and AWS creates one automatically
emr_additional_core_sg_id = join(",", module.aws-emr-sg-master.security_group_ids)
emr_service_access_sg_id = ""
}

# Upload the Tamr configuration to S3
Expand Down
Loading

0 comments on commit d5926ed

Please sign in to comment.