Skip to content

DataForge SDK for Custom Processing

SDK Overview

The DataForge SDK allows you to define your own Python or Scala code in a notebook and attach the notebook for automatic processing in DataForge. The SDK can be used for Custom Ingestions, Custom Parsing, and Custom Post Output processes. Notebooks used for Custom Ingestions and Parsing need to return a Spark DataFrame as a result of the code.

Details of the DataForge SDK can be found in the following documentation.

https://github.com/dataforgelabs/dataforge-sdk

Navigate the repository directories for your respective data platform (Databricks or Snowflake) and your preferred language to visit the Samples and see how you can start using the SDK.

Development Approach

Below is the recommended approach to developing a new custom notebook for processing.

Databricks

  1. Create an all-purpose cluster in Databricks and attach the DataForge SDK library
  2. Create a new notebook in Databricks and attach the Databricks cluster
  3. Create a compute configuration in DataForge and specify the Databricks notebook path + optional libraries
  4. Create a Source or Output in DataForge and attach the Custom Cluster Configuration
  5. Write your custom code in the Databricks notebook, incorporating any optional SDK parameters
  6. Test the notebook with a new Source data pull (Custom Ingest/Parse) or a new Output (Custom Post Output) to ensure the notebook runs correctly in conjunction with DataForge processes

Snowflake

  1. Create a new notebook in Snowflake
  2. Edit the notebook settings for External Access and enable the DataForge External Access Integration (named DATAFORGE_)
  3. Create a Source or Output in DataForge and specify the Snowflake notebook name (double quote the name if there are spaces")
  4. Write your custom code in the Snowflake notebook, incorporating any optional SDK parameters
  5. Test the notebook with a new Source data pull (Custom Ingest/Parse) or a new Output (Custom Post Output) to ensure the notebook runs correctly in conjunction with DataForge processes.