DataForge SDK for Custom Processing¶

SDK Overview¶

The DataForge SDK allows you to define your own Python or Scala code in a notebook and attach the notebook for automatic processing in DataForge. The SDK can be used for Custom Ingestions, Custom Parsing, and Custom Post Output processes. Notebooks used for Custom Ingestions and Parsing need to return a Spark DataFrame as a result of the code.

Details of the DataForge SDK can be found in the following documentation.

https://github.com/dataforgelabs/dataforge-sdk

Navigate the repository directories for your respective data platform (Databricks or Snowflake) and your preferred language to visit the Samples and see how you can start using the SDK.

Development Approach¶

Below is the recommended approach to developing a new custom notebook for processing.

Databricks¶

Create an all-purpose cluster in Databricks and attach the DataForge SDK library
Create a new notebook in Databricks and attach the Databricks cluster
Create a compute configuration in DataForge and specify the Databricks notebook path + optional libraries
Create a Source or Output in DataForge and attach the Custom Cluster Configuration
Write your custom code in the Databricks notebook, incorporating any optional SDK parameters
Test the notebook with a new Source data pull (Custom Ingest/Parse) or a new Output (Custom Post Output) to ensure the notebook runs correctly in conjunction with DataForge processes

Snowflake¶

Create a new notebook in Snowflake
Edit the notebook settings for External Access and enable the DataForge External Access Integration (named DATAFORGE_)
Create a Source or Output in DataForge and specify the Snowflake notebook name (double quote the name if there are spaces")
Write your custom code in the Snowflake notebook, incorporating any optional SDK parameters
Test the notebook with a new Source data pull (Custom Ingest/Parse) or a new Output (Custom Post Output) to ensure the notebook runs correctly in conjunction with DataForge processes.