Posts by Collection

portfolio

Integrated Data

Combining data from different sources and providing a unified view

What is usable data

We need data to have an acceptable quality, readily available and easily accessible.

Data Fabric simply explained

Published:

What? Enterprise wide consisted data management design. Why? Reduce the time to deliver data integration and interoperability How? Through metadata.

publications

Collect: Pre-ingest vs Post-ingest Processing

Published in Processing, 2024

I do not tend to draw hard lines between applying processing logic directly on the source system before extracting the data or performing transformations post-ingestion in an analytics platform. Both approaches are valid, depending on factors such as data volume, complexity, real-time requirements, and system architecture. However, most modern data scale needs require processing to be done post-ingestion.

Collect: Data Profiling

Published in Processing, 2024

Data profiling is essential for understanding the quality, structure, and consistency of data

Collect: Data Capture

Published in Processing, 2024

Capture data from source system for processing in an Analytics System

Measure Data Architecture

Published in Governance, 2024

Consistency on what we measure and how we measure data domains. An method with an example scenario

talks

Data Engineering Project Initiation Checklist

Published:

Some upfront work is required to ensure the success of data engineering projects. I have used this checklist to provide a framework for collaborating with multiple stakeholders to define clear requirements and designs.

Cloud Storage: Best practices

Published:

  • Buckets names: I am split on wheter to have have smart names which clear inform about the intent of the bucket and its files and the security concerns that may arise by doing so. If there is a need to hide the intent of buckets from possible attackers, we would need manage and enforce catalogs. However, I have seen the worst of both worlds in which the naming is gives enough and these buckets not being cataloged. I would recommend a naming coventions or rules to catalog bucket names and have audits to ensure compliance.

Parquet: Best practices demonstration

Published:

A often overlooked feature of Parquet is its support for Interoperability which is key to enterprise data plaforms which serves different tools and systems, facilitating data exchange and integration. This is my take on Parquet best practices and I have used python-pyarrow to demonstrate them.

teaching

MinIO Object Storage for Linux Locally (anywhere)

Lakehouse, Minio, 2024

When I play with new technologies, I like to plat it on my machine locally. Minio is a perfect simulation of cloud storage locally. You can deploy it locally and interact it like a S3 object storage.

ETL Data Test and Validate

Data Warehouse, Snowflake, 2024

Setup

  1. Installed DBT Core locally. The install configuration can be described by the command: dbt debug image

  2. Installed DBT Utils image

    image

AWS Lake Formation.

Data Lake, LakeFormation, 2025

AWS Lake Formation = Scaled Data Lake + Scaled Security Provisioning image