ETL build automation using Python
Motivation behind automation at work
As part of my blogs I pen my experiences as a tech lead and a developer. I have been part of many great projects and I also had fair share of tough ones. But every one of them has been a learning experience. As you read through you will realize this blog and hopefully my other blogs, I am penning not only technical thoughts but philosophical ones too.
Often one of the biggest challenge you face when working as a technical lead is to motivate the troops. Not every project is a flamboyant challenging piece of work that keeps you connected from start to end. In fact more often than not you are part of a build project that has routine but large volume of tedious work and this problem is very acute in support projects. In following example I explain an approach we took to tackle when came across such mundane workload.
Settings
We have a requirement to build an ODS which will house the copies of the OLTP entities/tables. Client BI Architects have an opinionated recommendations to ensure table structures in ODS are similar to what present in OLTP and maintain Type 2 Slowly Changing Dimension history with Dates & Flag. This meant we have simpler ETL jobs to be built but the number jobs are now exponential grown to number of tables on OLTP side (Approximate 200). Also team is not trained on the ETL tool being used SAP Data Services. Plus side of SAP Data Services is that it allows each ETL Job to be exported and imported in the form of an XML File. This XML File is well defined, captures all the code details and configurations and can act as a code artifact that can be shared across environments successfully.
Automation Solution
The client recommendations were blessing in disguise. Opinionated and standardize nature of these recommendations (Exact Column Names and Data Types for Type 2 maintenance) made this task ripe for automation. Team built a python program that uses source db metadata to generate ETL Jobs programmatically.
Building Blocks of the solution
XML Job Template: XML Template drives the logic and flow of the ETL. We created a working ETL Job with attribute/column name and Transform names starting with an identifiable patterns (Ex. Test_). This allowed us to identify the entities that need to change and configurations that remain constant.
Parse Tree: This function reads the XML Template using XML ElementTree python module and creates a tree. This tree can be traversed and analyzed very easily with existing ElementTree functions.
Metadata: Metadata is captured from source database that provides source table name, column name, primary key list, table indices, datatypes etc. details that required to generate the ETL code. This data is abstracted into a Pandas Data Frame. This allows abilities to easily query this data and loop through it.
Generate ETL Code: This is main program that takes source table name as input. It reads meta-data frame for that table name and based on the details loops through Element tree finds the transformations that need to be updated. Replaces existing test configurations with real db attributes. In the end updated ElementTree is exported to an XML Job file that can be easily imported to the tool
Benefits
Tangible Benefits:
Manual effort for building a SAP BODS Job ~ 3 hours (1100 Person Hours for 365 Jobs) . Efforts with Automation ~ 5 (35 hours total). Savings of 1000+ hrs
Intangible Benefits:
Code build with close to zero defect ratio. Faster time to market for enhancements as any source table change simply requires regeneration of the XML File.
Massive reputation build with client and most importantly great moral build for team as team doesn’t have to sit and work on mundane ETL development efforts.