AWS Pre Deployment Requirements¶
For each account below, create a service account (e.g. DataForge@dataforgelabs.com) so infrastructure is not tied to any individual employee.
Decide on Public or Private Endpoint Architecture¶
Public Endpoints
- UI/API will be accessible on public internet, secured with Auth0 for authentication and SSL certificate for HTTPS
- On-premise source systems can use Agent to bypass firewall and VPN tunneling to stream data into platform
Private Endpoints
- UI will be accessed through private VM that is deployed in the DataForge VNet, connections to the VM will be made using VPN or Amazon Appstream
- API is not publicly exposed
- Agent can only access networks that can be VPC Peered to DataForge VNet
- Public or Private Route 53 Hosted Zone can be used, depending on DNS settings
Please reach out to DataForge team for diagrams of both architectures
Amazon AppStream is not available in the us-east-2 and us-west-1 regions - please take this into consideration if using a private facing deployment.
URL Management¶
Two options exist for URL management:
- Create a sub-domain in an existing organization domain (e.g. dataforge.organization.com) and delegate control to Route 53 DNS.
- Pick an available domain and purchase in AWS.
Regardless of the above method, the domain needs to be purchased and validated before a DataForge deployment.
SSL Certificate¶
A valid SSL certificate that the client organization controls to perform secure connect and termination for DataForge websites. Select from the following:
- Use an existing certificate and define a subdomain allocated to DataForge.
- Purchase a new SSL certificate for a new domain or subdomain.
- If a subdomain is used, a certificate can be purchased on AWS ACM
- An Azure partner is Digicert.com
- Deployment requires either a wildcard certificate or two single domain certificates per environment.
- After purchase is complete, verify ownership of the domain to receive the certificate. This is a requirement for deployment.
Create a Docker Hub Account¶
Create a Docker Hub account (free edition, service account recommended). Share the Docker username with the DataForge team during initial deployment.
Sign up for a Databricks E2 Account¶
A credit card is required — have a corporate card ready. New accounts get 2 weeks of free usage, but access is cut immediately afterward if no card is on file.
Create a Databricks free trial account. This will be the E2 account used to create new workspaces.
Create a GitHub Account¶
Create a GitHub account. This will allow for access to the DataForge source code.
Create a Terraform Cloud Account¶
Create a Terraform Cloud account. This is for infrastructure deployment. The free edition will suffice for DataForge infrastructure.
Create an Auth0 Account¶
The Auth0 Developer tier is required. Create a service account dedicated to the DataForge deployment team.
Create an AWS Environment¶
If no AWS environment exists, create an account dedicated to the DataForge team; it must be able to create Active Directory resources.
Share the GitHub Account Username and Email Address with DataForge Team Members and Create the Fork¶
Share the GitHub username and email with the DataForge team — they'll grant read access to the DataForge repositories. Then follow this guide to fork the repository and set up the Terraform VCS provider.
Sign up for DataForge Subscription and Support¶
Use the following link to choose a support agreement with DataForge and enter your company and payment information for billing: DataForge Checkout
This agreement provides you access to the DataForge platform, platform version upgrades, and ongoing product support.
Decide on a VPN¶
If a VPN vendor is not already chosen, recommend to utilize Open VPN which can be deployed into the DataForge environment. VPN is not necessary for deployment - however, it may be necessary if using the private facing infrastructure
AWS Deployment Parameters¶
| Variable | Example | Description |
|---|---|---|
| awsRegion | us-west-2 | AWS Region for deployment - As of 11/9/2020 us-west-1 is not supported |
| awsAccessKey | xxx | Access Key for Master Deployment IAM user - mark as sensitive |
| awsSecretKey | xxx | Secret Key for Master Deployment IAM user - mark as sensitive |
| environment | dev | This will be prepended to resources in the environment. E.g. Dev. Prod. etc. |
| client | dataforge | This will be postpended to resources in the environment - use company or organization name |
| vpcCidrBlock | 10.1 | Only the first two digits here, not the full CIDR block |
| avalibilityZoneA | us-west-2a | Not all regions have availability zones |
| avalibilityZoneB | us-west-2b | Not all regions have availability zones |
| RDSretentionperiod | 7 | Database backup retention period (in days) |
| RDSmasterusername | rap_admin | Database master username - admin can't be used |
| RDSmasterpassword | password123 | Alphanumeric Database master password - mark sensitive |
| RDSport | 5432 | RDS port |
| TransitiontoAA | 60 | Transition to Standard-Infrequent Access |
| TransitiontoGLACIER | 360 | Transition to Amazon Glacier |
| stageUsername | stageuser | Database stage username for metastore access |
| stagePassword | password123 | Alphanumeric database stage password for metastore access - mark sensitive |
| imageVersion | 5.2.0 | Platform version |
| dockerUsername | dataforge | DockerHub service account username |
| dockerPassword | xxx | DockerHub service account password |
| urlEnvPrefix | dev | Prefix for environment site url |
| baseUrl | dataforgeplatform | the base URL of the certificate - example https://(urlEnvPrefix)(baseUrl).com This should not include www. .com or https://. e.g. "dataforge" |
| usEast1CertURL | *.dataforgeplatform.com | Full certificate name (with wildcards) used for SSL |
| auth0Domain | dataforgeplatform.auth0.com | Domain of Auth0 account |
| auth0ClientId | xxx | Client ID of API Explorer Application in Auth0 (needs to be generated when account is created) |
| auth0ClientSecret | xxx | Client Secret of API Explorer Application in Auth0 (needs to be generated when account is created) |
| databricksE2Enabled | yes | Is Databricks E2 architecture being used in this environment? |
| databricksAccountId | 638396f1-xxxx-xxxx-xxxx-ddf61adc4b06 | Account ID for Databricks E2 |
| databricksAccountUser | user@dataforge.com | Username for main E2 account user |
| databricksAccountPassword | xxxxxxxxx | Password for main E2 account user |
| readOnlyUsername | readonly | Username for Postgres read only user |
| readOnlyPassword | xxxxx | Password for Postgres read only user |
| sparkVersion | 15.4.x-scala2.12 | Default Databricks spark runtime |
| instanceType | m-fleet.xlarge | Default Databricks cluster Instance type |
| miniSparkyAutoTermination | 120 | Auto termination for mini-sparky in minutes |
| usageAuth0Secret | xxxx | Auth0 secret for usage collection - provided by DataForge team during deployment |
| usagePassword | xxxxxxx | Alphanumeric password for usage user, should be autogenerated |
| intelligentTiering | Enabled | Intelligent Tiering on datalake S3 bucket enabled or disabled |
Optional variables for tuning ECS container sizes:
| Variable | Example | Description |
|---|---|---|
| apiCPU | 2048 | CPU value for API ECS container (Units) |
| apiMemory | 4096 | Memory value for API ECS container (MB) |
| apiDesiredCount | 2 | Number of API containers running behind the Load Balancer. Adding containers increases API stability |
| coreCPU | 2048 | CPU value for Core ECS container (Units) |
| coreMemory | 4096 | Memory value for Core ECS container (MB) |
| agentCPU | 1024 | CPU value for Agent ECS container (Units) |
| agentMemory | 2048 | Memory value for Agent ECS container (MB) |
Optional variables for existing networking resources — work with the deployment and infrastructure team before using these:
| Variable | Example | Description |
|---|---|---|
| existingVPCId | vpc-051adc9a9b102c39e | VPC Id in AWS |
| existingInternetGatewayId | igw-011f8cdb7ebc48407 | Internet Gateway Id in AWS |
| existingNATGatewayId | nat-0c57db95f410e1d65 | NAT Gateway Id in AWS |
| existingPublicRouteTableId | rtb-0bf8f884ce37b1e9c | Public Route Table Id in AWS |
| existingPrivateRouteTableId | rtb-08805580a1ec35d36 | Private Route Table Id in AWS |
| existingWebAZ1Id | subnet-00cab5cd15a4e2f95 | Availability Zone 1 for UI |
| existingWebAZ2Id | subnet-00cab5cd15a4e2f97 | Availability Zone 2 for UI |
| existingAppAZ1Id | subnet-00cab5cd15a4e2f94 | Availability Zone 1 for ECS |
| existingAppAZ2Id | subnet-00cab5cd15a4e2f92 | Availability Zone 2 for ECS |
| existingDbAZ1Id | subnet-00cab5cd15a4e2f91 | Availability Zone 1 for Postgres |
| existingDbAZ2Id | subnet-00cab5cd15a4e2f93 | Availability Zone 2 for Postgres |
| existingDatabricksAZ1Id | subnet-00cab5cd15a4e2f99 | AZ1 for Databricks |
| existingDatabricksAZ2Id | subnet-00cab5cd15a4e2f96 | AZ2 for Databricks |
| customVpcBlock | 10.0.0.0/16 | Needs to cover all addresses in the custom subnets |
| customWebAZ1Block | 10.0.1.0/24 | Minimum 4 addresses |
| customWebAZ2Block | 10.0.4.0/24 | Minimum 4 addresses |
| customAppAZ1Block | 10.0.2.0/24 | Minimum 4 addresses |
| customAppAZ2Block | 10.0.5.0/24 | Minimum 4 addresses |
| customDbAZ1Block | 10.0.3.0/24 | Minimum 4 addresses |
| customDbAZ2Block | 10.0.6.0/24 | Minimum 4 addresses |
| customDatabricksAZ1Block | 10.0.128.0/18 | Minimum 255 addresses |
| customDatabricksAZ2Block | 10.0.192.0/18 | Minimum 255 addresses |
Non-public facing deployment (required):
| Variable | Example | Description |
|---|---|---|
| publicFacing | no | Triggers the infrastructure to deploy non-public facing resources |
| privateApiName | api.dataforge.test | API url |
| privateDomainName | dataforge.test | Base url for the environment |
| privateUIName | dev.dataforge.test | UI url |
Non-public facing deployment (optional):
| Variable | Example | Description |
|---|---|---|
| privateCertArn | arn:aws:acm:us-east-2:678910112:certificate/xxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx | ARN to an imported SSL certificate that will be attached to the HTTPS listener on the internal load balancer. If this variable is not added, a new certificate will be requested by the Terraform script. |
| privateRoute53ZoneId | Z04XXXXXXXX | Id for private hosted zone to add route 53 records to. If this variable is not added, a new private hosted zone will be created by the Terraform script. |
| usePublicRoute53 | no | If set to yes, an existing public route 53 zone will be used instead of using/creating a private zone. |
| vpnIP | 10.0.0.1/16 | IP range to whitelist traffic to the private UI container |
Verify the deployment¶
Once DataForge is up and running, the Data Integration Example in the Getting Started Guide can be followed to verify that the full DataForge stack is working correctly.