Handling Hard Deletes from the Source System
Dealing with data can feel like walking a tightrope—maintaining accuracy and consistency while avoiding unnecessary losses is a challenge every organization faces. When working with real-time data systems, the way you handle deleted data often determines the integrity and usability of your datasets. This post dives deep into hard and soft deletes, the challenges of managing deleted records during incremental loads, and practical approaches used to address them.
The Difference Between Hard and Soft Deletes
Think of a hard delete as wiping something off your whiteboard permanently. This operation removes data entirely from a database, making it unrecoverable. While this approach reduces data size and improves performance, it poses risks such as unintended data loss or compromised referential integrity unless it’s properly safeguarded or backed up.
Soft deletes, on the other hand, are like crossing something off a list but keeping it for reference. Here, the record stays in the database but is marked as inactive using a “DELETE_FLAG” or similar mechanism. This approach prioritizes data recovery and auditability, making it ideal for businesses that require traceability and a safety net for mistakes.
The Business Problem
Our team at Apps Associates encountered a challenging scenario in a recent project involving Oracle Fusion ERP as the Source System and Azure Data Lake as the Target System. Specifically, we had to identify and handle records deleted in the source during incremental loads.
Here’s where the challenge lies:
- Full Load refreshes the entire data in the Target System daily, overwriting previous records.
- Incremental Load, however, fetches and merges only the changed or new records since the last run date while keeping the unchanged data intact.
The issue arose when records were deleted in the Source System. During Incremental Load, these deleted records did not update their status in the Target System, creating discrepancies and mismatched datasets downstream.
Accurately reflecting deleted records in the Target System is critical to prevent consequences such as compromised data accuracy, compliance risks, and potential downstream errors.
Why Deleted Records Matter
Deleted records are more than just null entries in a database—they can disrupt processes and impact critical business operations. Here’s why they can’t be ignored:
- Data Integrity: Mismatched datasets or orphaned records can create inconsistencies that ripple across interconnected systems.
- Audit and Compliance: Industries often require records to adhere to specific regulations, including tracking of deletions for transparency.
- Data Synchronization: Knowing what’s missing ensures systems mirror real-time data accurately.
- Performance Optimization: Pruning unneeded or outdated data frees up storage and improves query speed.
- Error Recovery: Maintaining soft-delete records enables quick resolutions to erroneous deletions.
The Solution
To handle these scenarios, we proposed a solution involving a “DELETE_FLAG” column in the Target System. This column tracks the status of records as active or deleted. Here are two approaches used to implement this solution effectively:
Approach 1 – Using Merge Statement
This approach utilizes Apache Spark’s Merge feature to accurately flag deleted records.
- Load target data, adding a “DELETE_FLAG” column with an initial value set to “N.”
- Extract only the primary keys from the source data into a Bronze Layer.
- Perform an Upsert/Merge Operation:
- Match the records between the source and the Target System based on primary keys.
- Mark unmatched records in the Target System with “DELETE_FLAG = Y.”
- Verify the flagged data in the Silver Layer by comparing it with the source.
- Validate the flagged records to ensure accuracy.
Approach 2 – Using Anti-Join with Merge Statement
An anti-join identifies missing records by returning records in the target data that have no match in the source.
- Load target data with a “DELETE_FLAG” column initialized to “N.”
- Extract primary keys from the source into a Bronze Layer.
- Use an Anti-Join Operation between the source and the target to find non-matching records.
- Update the “DELETE_FLAG” column to “Y” for the unmatched records.
- Revalidate discrepancies by comparing the Primary Keys in the Target System against the source.
Both approaches ensure that deleted records from the Source System are accurately flagged in the Target System, maintaining alignment and correctness.
Business Benefits
These solutions provide a range of practical and strategic advantages:
- Data Recovery: Accidentally deleted records can be easily restored without relying on backups.
- Improved Data Integrity: Helps maintain relationships within datasets by avoiding unintended data loss.
- Error Mitigation: Reduces risks linked to accidental hard deletes by creating a safety net for modifications.
- Transparency & Accountability: Allows tracking of deletions, including when and why they occurred, for better governance.
- Enhanced User Experience: Customers and end-users gain access to their data without needing lengthy support processes.
Wrapping It Up
Handling deleted records effectively is a critical component of robust data practices, particularly during incremental loads. By adding a “DELETE_FLAG” into the Target System and incorporating advanced techniques like merges and anti-joins, your systems can stay synchronized, accurate, and compliant.
At Apps Associates, we pride ourselves on designing data solutions that address complex business problems. The strategy we outlined in this post ensures that organizations can retain control over their data while minimizing risks.
If your organization faces similar challenges and you’re eager to learn more, reach out to our team. We’d love to help turn your data issues into streamlined solutions that drive real business value.