Sitemap

A list of all the posts and pages found on the site. For you robots out there is an XML version available for digesting as well.

Pages

Data Retention Policies

Published in Policies, 2024

Establish clear policies for how long data should be retained based on regulatory requirements and business needs.

Solve technical problems, increase efficiency and productivity, and improve systems </a> </h2> </article> </div>

Posts

Hello PyTorch

13 minute read

Published:

The objective is to use the very basic example of linear regression and use PyTorch to build a model to demonstrate Pytorch workflow and its fundementals.

Machine Learning for daily tasks

9 minute read

Published:

I was tasked with planning demand for tickets for a complex Application Maintenance System that supports multiple companies. There was some historical data available, and it was invaluable. Using machine learning with sklearn, we were able to predict ticket volumes on a monthly basis with a very high degree of accuracy. Going through the usecase

Blog Post number 3

less than 1 minute read

Published:

This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.

portfolio

Integrated Data

Combining data from different sources and providing a unified view

What is usable data

We need data to have an acceptable quality, readily available and easily accessible.

Data Fabric simply explained

Published:

What? Enterprise wide consisted data management design. Why? Reduce the time to deliver data integration and interoperability How? Through metadata.

publications

Collect: Pre-ingest vs Post-ingest Processing

Published in Processing, 2024

I do not tend to draw hard lines between applying processing logic directly on the source system before extracting the data or performing transformations post-ingestion in an analytics platform. Both approaches are valid, depending on factors such as data volume, complexity, real-time requirements, and system architecture. However, most modern data scale needs require processing to be done post-ingestion.

Collect: Data Profiling

Published in Processing, 2024

Data profiling is essential for understanding the quality, structure, and consistency of data

Collect: Data Capture

Published in Processing, 2024

Capture data from source system for processing in an Analytics System

Measure Data Architecture

Published in Governance, 2024

Consistency on what we measure and how we measure data domains. An method with an example scenario

talks

Data Engineering Project Initiation Checklist

Published:

Some upfront work is required to ensure the success of data engineering projects. I have used this checklist to provide a framework for collaborating with multiple stakeholders to define clear requirements and designs.

Cloud Storage: Best practices

Published:

  • Buckets names: I am split on wheter to have have smart names which clear inform about the intent of the bucket and its files and the security concerns that may arise by doing so. If there is a need to hide the intent of buckets from possible attackers, we would need manage and enforce catalogs. However, I have seen the worst of both worlds in which the naming is gives enough and these buckets not being cataloged. I would recommend a naming coventions or rules to catalog bucket names and have audits to ensure compliance.

Parquet: Best practices demonstration

Published:

A often overlooked feature of Parquet is its support for Interoperability which is key to enterprise data plaforms which serves different tools and systems, facilitating data exchange and integration. This is my take on Parquet best practices and I have used python-pyarrow to demonstrate them.

teaching

MinIO Object Storage for Linux Locally (anywhere)

Lakehouse, Minio, 2024

When I play with new technologies, I like to plat it on my machine locally. Minio is a perfect simulation of cloud storage locally. You can deploy it locally and interact it like a S3 object storage.

ETL Data Test and Validate

Data Warehouse, Snowflake, 2024

Setup

  1. Installed DBT Core locally. The install configuration can be described by the command: dbt debug image

  2. Installed DBT Utils image

    image

AWS Lake Formation.

Data Lake, LakeFormation, 2025

AWS Lake Formation = Scaled Data Lake + Scaled Security Provisioning image