Data Engineering Project Initiation Checklist

Date: March 01, 2019

Some upfront work is required to ensure the success of data engineering projects. I have used this checklist to provide a framework for collaborating with multiple stakeholders to define clear requirements and designs.

Acquisition

Are the data sources external to the cloud? If so, we need an ingestion design (compressed/encrypted data from external sources) which adds complexity to the overall solution.
Regardless of the extraction pattern (Delta, Full Load), it is recommended to use a data file landing stage on cloud storage for roll-back, archival (raw data), and audit purposes.
Establish standard tools and SLAs with extraction/source teams that significantly impact timelines and the quality of initial deliveries.
Is data encrypted in-flight/at-rest? Is the security architecture built around acquiring the data, with a regular audit process to ensure compliance?
Establish the frequency of data extraction from sources to the system based on entity type, reporting needs, and source system constraints. This should be handled on a case-by-case basis as one size does not fit all.
Can delta-loads be extracted from the source system?
When starting, it’s recommended to extract only the necessary data from source systems to avoid overwhelm. Do we have a list of tables that need extraction?
What format (type, compression, size, etc.) should the extracted data be in?
How much reliability needs to be built into the extraction process?

Transformation/Processing

What stages (Logical Partitions) do we have within the Data lake? Clear partitions are needed for computing needs, security, and risk profile separation.
Is there personal or sensitive data that needs anonymization or concealment to protect privacy and ensure regulatory compliance?
Are there clear data cleansing requirements? Do we need to use data-profiling tools to scan for missing or inaccurate values, poorly structured fields, and ensure Accuracy, Completeness, Consistency, Timeliness, Validity, Uniqueness?
Do we have a canonical data model to represent main subject areas? What is the data model impedance between the data source model and the target canonical model?
How many types of consumers are there for the single source of truth (canonical data)?
What are the decision criteria for selecting the technology stack for processing data?
Is there a dimensional model in place?
Do we have an estimate of the volume of data to be processed? What performance considerations need to be addressed accordingly?
Is there a design to handle slowly changing dimensions? This impacts the SCD type.
Do we need to maintain a data catalog or schemas? Is there a mapping between final reports and tables?
Are there governance requirements that need to be included?
Reliability should be built into the pipelines.

Serving

How many reports do we have, and are the requirements clear?
Is there a logical grouping of reports from a development perspective?
Do we have a mapping between reports and data requirements?
The serving data model should follow the rule of thumb: one query to one table.
Separate self-service/ad-hoc reporting from high-frequency operational reporting.

Share on

Twitter Facebook LinkedIn

Kris NUNES

Acquisition

Transformation/Processing

Serving

Share on