Cleanup Configuration¶
Cleanup Configuration defines retention settings for data lake objects and metadata. Access it via System Configuration → Cleanup Configurations.
A default configuration is created automatically and assigned to all sources.
Cleanup Parameters¶
| Parameter | Description |
|---|---|
| Hub delete type | "Keep latest version only" setting will remove all non-current files and folders with hub table data from data lake "Keep all versions" disables hub data objects cleanup |
| Inputs No Effective Range Period | Retention period for batches (inputs) of data with no effective range (data). Applies to sources with key, timestamp and sequence refresh types |
| Full Refresh Inputs Period | Retention period for not current/latest batches (inputs) of data. Applies to sources with full refresh type |
| Zero Record Inputs Period | Retention period for inputs (batches) containing zero records. Applies to all source refresh types |
| Failed Ingestion Inputs Period | Retention period for inputs that failed Ingestion |
Cleanup deletes inputs from the metadata store per these settings, then removes orphaned data lake objects (deleted inputs and sources).
Configuring Cleanup for the Source¶
Sources are assigned the default cleanup configuration on creation. To change it, open source settings and select a different configuration:
Customizing Cleanup Run Schedule¶
A default "Cleanup" schedule is created in every environment set to run nightly at 12PM UTC. Do not rename this schedule — the process recognizes it by name. To adjust run times, open the Cleanup schedule, update the cron values, and save; changes take effect after the next run or a Core service restart.
DataForge recommends running Cleanup at least once per week to control cloud costs. Sources that haven't had Cleanup run in over 7 days show a garbage can icon in their status. 
Cleanup can also be started manually from System Configuration → Service Configurations.
Customizing Compute Configuration for Cleanup¶
To customize compute for cleanup, open the compute configuration named Cleanup and save your changes. If cleanup is running slowly, switching to Single Node can reduce processing time and costs.
Suggested settings to use for this configuration:
- Name: Cleanup
- Description: Cleanup compute config
- Compute Type: Job
- Scale Mode: Single Node
- Job Task Type: DataForge jar
- Spark Version: 15.4.x-scala2.12
- Node Type: m-fleet.2xlarge (or equivalent type for respective data platform, may want to scale up or down depending on size of environment)
- Enable Elastic Disk: True



