Skip to content

AWS Pre Deployment Requirements

For each account below, create a service account (e.g. DataForge@dataforgelabs.com) so infrastructure is not tied to any individual employee.


Decide on Public or Private Endpoint Architecture

Public Endpoints

  • UI/API will be accessible on public internet, secured with Auth0 for authentication and SSL certificate for HTTPS
  • On-premise source systems can use Agent to bypass firewall and VPN tunneling to stream data into platform

Private Endpoints

  • UI will be accessed through private VM that is deployed in the DataForge VNet, connections to the VM will be made using VPN or Amazon Appstream
  • API is not publicly exposed
  • Agent can only access networks that can be VPC Peered to DataForge VNet
  • Public or Private Route 53 Hosted Zone can be used, depending on DNS settings

Please reach out to DataForge team for diagrams of both architectures

Amazon AppStream is not available in the us-east-2 and us-west-1 regions - please take this into consideration if using a private facing deployment.


URL Management

Two options exist for URL management:

  1. Create a sub-domain in an existing organization domain (e.g. dataforge.organization.com) and delegate control to Route 53 DNS.
  2. Pick an available domain and purchase in AWS.

Regardless of the above method, the domain needs to be purchased and validated before a DataForge deployment.


SSL Certificate

A valid SSL certificate that the client organization controls to perform secure connect and termination for DataForge websites. Select from the following:

  • Use an existing certificate and define a subdomain allocated to DataForge.
  • Purchase a new SSL certificate for a new domain or subdomain.
  • If a subdomain is used, a certificate can be purchased on AWS ACM
  • An Azure partner is Digicert.com
  • Deployment requires either a wildcard certificate or two single domain certificates per environment.
  • After purchase is complete, verify ownership of the domain to receive the certificate. This is a requirement for deployment.

Create a Docker Hub Account

Create a Docker Hub account (free edition, service account recommended). Share the Docker username with the DataForge team during initial deployment.


Sign up for a Databricks E2 Account

A credit card is required — have a corporate card ready. New accounts get 2 weeks of free usage, but access is cut immediately afterward if no card is on file.

Create a Databricks free trial account. This will be the E2 account used to create new workspaces.


Create a GitHub Account

Create a GitHub account. This will allow for access to the DataForge source code.


Create a Terraform Cloud Account

Create a Terraform Cloud account. This is for infrastructure deployment. The free edition will suffice for DataForge infrastructure.


Create an Auth0 Account

The Auth0 Developer tier is required. Create a service account dedicated to the DataForge deployment team.


Create an AWS Environment

If no AWS environment exists, create an account dedicated to the DataForge team; it must be able to create Active Directory resources.


Share the GitHub Account Username and Email Address with DataForge Team Members and Create the Fork

Share the GitHub username and email with the DataForge team — they'll grant read access to the DataForge repositories. Then follow this guide to fork the repository and set up the Terraform VCS provider.


Sign up for DataForge Subscription and Support

Use the following link to choose a support agreement with DataForge and enter your company and payment information for billing: DataForge Checkout

This agreement provides you access to the DataForge platform, platform version upgrades, and ongoing product support.


Decide on a VPN

If a VPN vendor is not already chosen, recommend to utilize Open VPN which can be deployed into the DataForge environment. VPN is not necessary for deployment - however, it may be necessary if using the private facing infrastructure


AWS Deployment Parameters

Variable Example Description
awsRegion us-west-2 AWS Region for deployment - As of 11/9/2020 us-west-1 is not supported
awsAccessKey xxx Access Key for Master Deployment IAM user - mark as sensitive
awsSecretKey xxx Secret Key for Master Deployment IAM user - mark as sensitive
environment dev This will be prepended to resources in the environment. E.g. Dev. Prod. etc.
client dataforge This will be postpended to resources in the environment - use company or organization name
vpcCidrBlock 10.1 Only the first two digits here, not the full CIDR block
avalibilityZoneA us-west-2a Not all regions have availability zones
avalibilityZoneB us-west-2b Not all regions have availability zones
RDSretentionperiod 7 Database backup retention period (in days)
RDSmasterusername rap_admin Database master username - admin can't be used
RDSmasterpassword password123 Alphanumeric Database master password - mark sensitive
RDSport 5432 RDS port
TransitiontoAA 60 Transition to Standard-Infrequent Access
TransitiontoGLACIER 360 Transition to Amazon Glacier
stageUsername stageuser Database stage username for metastore access
stagePassword password123 Alphanumeric database stage password for metastore access - mark sensitive
imageVersion 5.2.0 Platform version
dockerUsername dataforge DockerHub service account username
dockerPassword xxx DockerHub service account password
urlEnvPrefix dev Prefix for environment site url
baseUrl dataforgeplatform the base URL of the certificate - example https://(urlEnvPrefix)(baseUrl).com This should not include www. .com or https://. e.g. "dataforge"
usEast1CertURL *.dataforgeplatform.com Full certificate name (with wildcards) used for SSL
auth0Domain dataforgeplatform.auth0.com Domain of Auth0 account
auth0ClientId xxx Client ID of API Explorer Application in Auth0 (needs to be generated when account is created)
auth0ClientSecret xxx Client Secret of API Explorer Application in Auth0 (needs to be generated when account is created)
databricksE2Enabled yes Is Databricks E2 architecture being used in this environment?
databricksAccountId 638396f1-xxxx-xxxx-xxxx-ddf61adc4b06 Account ID for Databricks E2
databricksAccountUser user@dataforge.com Username for main E2 account user
databricksAccountPassword xxxxxxxxx Password for main E2 account user
readOnlyUsername readonly Username for Postgres read only user
readOnlyPassword xxxxx Password for Postgres read only user
sparkVersion 15.4.x-scala2.12 Default Databricks spark runtime
instanceType m-fleet.xlarge Default Databricks cluster Instance type
miniSparkyAutoTermination 120 Auto termination for mini-sparky in minutes
usageAuth0Secret xxxx Auth0 secret for usage collection - provided by DataForge team during deployment
usagePassword xxxxxxx Alphanumeric password for usage user, should be autogenerated
intelligentTiering Enabled Intelligent Tiering on datalake S3 bucket enabled or disabled

Optional variables for tuning ECS container sizes:

Variable Example Description
apiCPU 2048 CPU value for API ECS container (Units)
apiMemory 4096 Memory value for API ECS container (MB)
apiDesiredCount 2 Number of API containers running behind the Load Balancer. Adding containers increases API stability
coreCPU 2048 CPU value for Core ECS container (Units)
coreMemory 4096 Memory value for Core ECS container (MB)
agentCPU 1024 CPU value for Agent ECS container (Units)
agentMemory 2048 Memory value for Agent ECS container (MB)

Optional variables for existing networking resources — work with the deployment and infrastructure team before using these:

Variable Example Description
existingVPCId vpc-051adc9a9b102c39e VPC Id in AWS
existingInternetGatewayId igw-011f8cdb7ebc48407 Internet Gateway Id in AWS
existingNATGatewayId nat-0c57db95f410e1d65 NAT Gateway Id in AWS
existingPublicRouteTableId rtb-0bf8f884ce37b1e9c Public Route Table Id in AWS
existingPrivateRouteTableId rtb-08805580a1ec35d36 Private Route Table Id in AWS
existingWebAZ1Id subnet-00cab5cd15a4e2f95 Availability Zone 1 for UI
existingWebAZ2Id subnet-00cab5cd15a4e2f97 Availability Zone 2 for UI
existingAppAZ1Id subnet-00cab5cd15a4e2f94 Availability Zone 1 for ECS
existingAppAZ2Id subnet-00cab5cd15a4e2f92 Availability Zone 2 for ECS
existingDbAZ1Id subnet-00cab5cd15a4e2f91 Availability Zone 1 for Postgres
existingDbAZ2Id subnet-00cab5cd15a4e2f93 Availability Zone 2 for Postgres
existingDatabricksAZ1Id subnet-00cab5cd15a4e2f99 AZ1 for Databricks
existingDatabricksAZ2Id subnet-00cab5cd15a4e2f96 AZ2 for Databricks
customVpcBlock 10.0.0.0/16 Needs to cover all addresses in the custom subnets
customWebAZ1Block 10.0.1.0/24 Minimum 4 addresses
customWebAZ2Block 10.0.4.0/24 Minimum 4 addresses
customAppAZ1Block 10.0.2.0/24 Minimum 4 addresses
customAppAZ2Block 10.0.5.0/24 Minimum 4 addresses
customDbAZ1Block 10.0.3.0/24 Minimum 4 addresses
customDbAZ2Block 10.0.6.0/24 Minimum 4 addresses
customDatabricksAZ1Block 10.0.128.0/18 Minimum 255 addresses
customDatabricksAZ2Block 10.0.192.0/18 Minimum 255 addresses

Non-public facing deployment (required):

Variable Example Description
publicFacing no Triggers the infrastructure to deploy non-public facing resources
privateApiName api.dataforge.test API url
privateDomainName dataforge.test Base url for the environment
privateUIName dev.dataforge.test UI url

Non-public facing deployment (optional):

Variable Example Description
privateCertArn arn:aws:acm:us-east-2:678910112:certificate/xxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx ARN to an imported SSL certificate that will be attached to the HTTPS listener on the internal load balancer. If this variable is not added, a new certificate will be requested by the Terraform script.
privateRoute53ZoneId Z04XXXXXXXX Id for private hosted zone to add route 53 records to. If this variable is not added, a new private hosted zone will be created by the Terraform script.
usePublicRoute53 no If set to yes, an existing public route 53 zone will be used instead of using/creating a private zone.
vpnIP 10.0.0.1/16 IP range to whitelist traffic to the private UI container

Verify the deployment

Once DataForge is up and running, the Data Integration Example in the Getting Started Guide can be followed to verify that the full DataForge stack is working correctly.