Data Engineering Project Initiation Checklist
Date:
Some upfront work is required to ensure the success of data engineering projects. I have used this checklist to provide a framework for collaborating with multiple stakeholders to define clear requirements and designs.
Acquisition
- Are the data sources external to the cloud? If so, we need an ingestion design (compressed/encrypted data from external sources) which adds complexity to the overall solution.
- Regardless of the extraction pattern (Delta, Full Load), it is recommended to use a data file landing stage on cloud storage for roll-back, archival (raw data), and audit purposes.
- Establish standard tools and SLAs with extraction/source teams that significantly impact timelines and the quality of initial deliveries.
- Is data encrypted in-flight/at-rest? Is the security architecture built around acquiring the data, with a regular audit process to ensure compliance?
- Establish the frequency of data extraction from sources to the system based on entity type, reporting needs, and source system constraints. This should be handled on a case-by-case basis as one size does not fit all.
- Can delta-loads be extracted from the source system?
- When starting, it’s recommended to extract only the necessary data from source systems to avoid overwhelm. Do we have a list of tables that need extraction?
- What format (type, compression, size, etc.) should the extracted data be in?
- How much reliability needs to be built into the extraction process?
Transformation/Processing
- What stages (Logical Partitions) do we have within the Data lake? Clear partitions are needed for computing needs, security, and risk profile separation.
- Is there personal or sensitive data that needs anonymization or concealment to protect privacy and ensure regulatory compliance?
- Are there clear data cleansing requirements? Do we need to use data-profiling tools to scan for missing or inaccurate values, poorly structured fields, and ensure Accuracy, Completeness, Consistency, Timeliness, Validity, Uniqueness?
- Do we have a canonical data model to represent main subject areas? What is the data model impedance between the data source model and the target canonical model?
- How many types of consumers are there for the single source of truth (canonical data)?
- What are the decision criteria for selecting the technology stack for processing data?
- Is there a dimensional model in place?
- Do we have an estimate of the volume of data to be processed? What performance considerations need to be addressed accordingly?
- Is there a design to handle slowly changing dimensions? This impacts the SCD type.
- Do we need to maintain a data catalog or schemas? Is there a mapping between final reports and tables?
- Are there governance requirements that need to be included?
- Reliability should be built into the pipelines.
Serving
- How many reports do we have, and are the requirements clear?
- Is there a logical grouping of reports from a development perspective?
- Do we have a mapping between reports and data requirements?
- The serving data model should follow the rule of thumb: one query to one table.
- Separate self-service/ad-hoc reporting from high-frequency operational reporting.