AWS Pre Deployment Requirements¶

For each account below, create a service account (e.g. DataForge@dataforgelabs.com) so infrastructure is not tied to any individual employee.

Decide on Public or Private Endpoint Architecture¶

Public Endpoints

UI/API will be accessible on public internet, secured with Auth0 for authentication and SSL certificate for HTTPS
On-premise source systems can use Agent to bypass firewall and VPN tunneling to stream data into platform

Private Endpoints

UI will be accessed through private VM that is deployed in the DataForge VNet, connections to the VM will be made using VPN or Amazon Appstream
API is not publicly exposed
Agent can only access networks that can be VPC Peered to DataForge VNet
Public or Private Route 53 Hosted Zone can be used, depending on DNS settings

Please reach out to DataForge team for diagrams of both architectures

Amazon AppStream is not available in the us-east-2 and us-west-1 regions - please take this into consideration if using a private facing deployment.

URL Management¶

Two options exist for URL management:

Create a sub-domain in an existing organization domain (e.g. dataforge.organization.com) and delegate control to Route 53 DNS.
Pick an available domain and purchase in AWS.

Regardless of the above method, the domain needs to be purchased and validated before a DataForge deployment.

SSL Certificate¶

A valid SSL certificate that the client organization controls to perform secure connect and termination for DataForge websites. Select from the following:

Use an existing certificate and define a subdomain allocated to DataForge.
Purchase a new SSL certificate for a new domain or subdomain.
If a subdomain is used, a certificate can be purchased on AWS ACM
An Azure partner is Digicert.com
Deployment requires either a wildcard certificate or two single domain certificates per environment.
After purchase is complete, verify ownership of the domain to receive the certificate. This is a requirement for deployment.

Create a Docker Hub Account¶

Create a Docker Hub account (free edition, service account recommended). Share the Docker username with the DataForge team during initial deployment.

A credit card is required — have a corporate card ready. New accounts get 2 weeks of free usage, but access is cut immediately afterward if no card is on file.

Create a Databricks free trial account. This will be the E2 account used to create new workspaces.

Create a GitHub Account¶

Create a GitHub account. This will allow for access to the DataForge source code.

Create a Terraform Cloud Account¶

Create a Terraform Cloud account. This is for infrastructure deployment. The free edition will suffice for DataForge infrastructure.

Create an Auth0 Account¶

The Auth0 Developer tier is required. Create a service account dedicated to the DataForge deployment team.

Create an AWS Environment¶

If no AWS environment exists, create an account dedicated to the DataForge team; it must be able to create Active Directory resources.

Share the GitHub username and email with the DataForge team — they'll grant read access to the DataForge repositories. Then follow this guide to fork the repository and set up the Terraform VCS provider.

Use the following link to choose a support agreement with DataForge and enter your company and payment information for billing: DataForge Checkout

This agreement provides you access to the DataForge platform, platform version upgrades, and ongoing product support.

Decide on a VPN¶

If a VPN vendor is not already chosen, recommend to utilize Open VPN which can be deployed into the DataForge environment. VPN is not necessary for deployment - however, it may be necessary if using the private facing infrastructure

AWS Deployment Parameters¶

Variable	Example	Description
awsRegion	us-west-2	AWS Region for deployment - As of 11/9/2020 us-west-1 is not supported
awsAccessKey	xxx	Access Key for Master Deployment IAM user - mark as sensitive
awsSecretKey	xxx	Secret Key for Master Deployment IAM user - mark as sensitive
environment	dev	This will be prepended to resources in the environment. E.g. Dev. Prod. etc.
client	dataforge	This will be postpended to resources in the environment - use company or organization name
vpcCidrBlock	10.1	Only the first two digits here, not the full CIDR block
avalibilityZoneA	us-west-2a	Not all regions have availability zones
avalibilityZoneB	us-west-2b	Not all regions have availability zones
RDSretentionperiod	7	Database backup retention period (in days)
RDSmasterusername	rap_admin	Database master username - admin can't be used
RDSmasterpassword	password123	Alphanumeric Database master password - mark sensitive
RDSport	5432	RDS port
TransitiontoAA	60	Transition to Standard-Infrequent Access
TransitiontoGLACIER	360	Transition to Amazon Glacier
stageUsername	stageuser	Database stage username for metastore access
stagePassword	password123	Alphanumeric database stage password for metastore access - mark sensitive
imageVersion	5.2.0	Platform version
dockerUsername	dataforge	DockerHub service account username
dockerPassword	xxx	DockerHub service account password
urlEnvPrefix	dev	Prefix for environment site url
baseUrl	dataforgeplatform	the base URL of the certificate - example https://(urlEnvPrefix)(baseUrl).com This should not include www. .com or https://. e.g. "dataforge"
usEast1CertURL	*.dataforgeplatform.com	Full certificate name (with wildcards) used for SSL
auth0Domain	dataforgeplatform.auth0.com	Domain of Auth0 account
auth0ClientId	xxx	Client ID of API Explorer Application in Auth0 (needs to be generated when account is created)
auth0ClientSecret	xxx	Client Secret of API Explorer Application in Auth0 (needs to be generated when account is created)
databricksE2Enabled	yes	Is Databricks E2 architecture being used in this environment?
databricksAccountId	638396f1-xxxx-xxxx-xxxx-ddf61adc4b06	Account ID for Databricks E2
databricksAccountUser	user@dataforge.com	Username for main E2 account user
databricksAccountPassword	xxxxxxxxx	Password for main E2 account user
readOnlyUsername	readonly	Username for Postgres read only user
readOnlyPassword	xxxxx	Password for Postgres read only user
sparkVersion	15.4.x-scala2.12	Default Databricks spark runtime
instanceType	m-fleet.xlarge	Default Databricks cluster Instance type
miniSparkyAutoTermination	120	Auto termination for mini-sparky in minutes
usageAuth0Secret	xxxx	Auth0 secret for usage collection - provided by DataForge team during deployment
usagePassword	xxxxxxx	Alphanumeric password for usage user, should be autogenerated
intelligentTiering	Enabled	Intelligent Tiering on datalake S3 bucket enabled or disabled

Optional variables for tuning ECS container sizes:

Variable	Example	Description
apiCPU	2048	CPU value for API ECS container (Units)
apiMemory	4096	Memory value for API ECS container (MB)
apiDesiredCount	2	Number of API containers running behind the Load Balancer. Adding containers increases API stability
coreCPU	2048	CPU value for Core ECS container (Units)
coreMemory	4096	Memory value for Core ECS container (MB)
agentCPU	1024	CPU value for Agent ECS container (Units)
agentMemory	2048	Memory value for Agent ECS container (MB)

Optional variables for existing networking resources — work with the deployment and infrastructure team before using these:

Variable	Example	Description
existingVPCId	vpc-051adc9a9b102c39e	VPC Id in AWS
existingInternetGatewayId	igw-011f8cdb7ebc48407	Internet Gateway Id in AWS
existingNATGatewayId	nat-0c57db95f410e1d65	NAT Gateway Id in AWS
existingPublicRouteTableId	rtb-0bf8f884ce37b1e9c	Public Route Table Id in AWS
existingPrivateRouteTableId	rtb-08805580a1ec35d36	Private Route Table Id in AWS
existingWebAZ1Id	subnet-00cab5cd15a4e2f95	Availability Zone 1 for UI
existingWebAZ2Id	subnet-00cab5cd15a4e2f97	Availability Zone 2 for UI
existingAppAZ1Id	subnet-00cab5cd15a4e2f94	Availability Zone 1 for ECS
existingAppAZ2Id	subnet-00cab5cd15a4e2f92	Availability Zone 2 for ECS
existingDbAZ1Id	subnet-00cab5cd15a4e2f91	Availability Zone 1 for Postgres
existingDbAZ2Id	subnet-00cab5cd15a4e2f93	Availability Zone 2 for Postgres
existingDatabricksAZ1Id	subnet-00cab5cd15a4e2f99	AZ1 for Databricks
existingDatabricksAZ2Id	subnet-00cab5cd15a4e2f96	AZ2 for Databricks
customVpcBlock	10.0.0.0/16	Needs to cover all addresses in the custom subnets
customWebAZ1Block	10.0.1.0/24	Minimum 4 addresses
customWebAZ2Block	10.0.4.0/24	Minimum 4 addresses
customAppAZ1Block	10.0.2.0/24	Minimum 4 addresses
customAppAZ2Block	10.0.5.0/24	Minimum 4 addresses
customDbAZ1Block	10.0.3.0/24	Minimum 4 addresses
customDbAZ2Block	10.0.6.0/24	Minimum 4 addresses
customDatabricksAZ1Block	10.0.128.0/18	Minimum 255 addresses
customDatabricksAZ2Block	10.0.192.0/18	Minimum 255 addresses

Non-public facing deployment (required):

Variable	Example	Description
publicFacing	no	Triggers the infrastructure to deploy non-public facing resources
privateApiName	api.dataforge.test	API url
privateDomainName	dataforge.test	Base url for the environment
privateUIName	dev.dataforge.test	UI url

Non-public facing deployment (optional):

Variable	Example	Description
privateCertArn	arn:aws:acm:us-east-2:678910112:certificate/xxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx	ARN to an imported SSL certificate that will be attached to the HTTPS listener on the internal load balancer. If this variable is not added, a new certificate will be requested by the Terraform script.
privateRoute53ZoneId	Z04XXXXXXXX	Id for private hosted zone to add route 53 records to. If this variable is not added, a new private hosted zone will be created by the Terraform script.
usePublicRoute53	no	If set to yes, an existing public route 53 zone will be used instead of using/creating a private zone.
vpnIP	10.0.0.1/16	IP range to whitelist traffic to the private UI container

Verify the deployment¶

Once DataForge is up and running, the Data Integration Example in the Getting Started Guide can be followed to verify that the full DataForge stack is working correctly.