Posts by Collection

portfolio

What is it that we want from our data architecture

This is the dream: We strive for a unified and usable data and analytic platform

What is unified data

A single, coherent, and consistent view of the data across the organization

Integrated Data

Combining data from different sources and providing a unified view

What is usable data

We need data to have an acceptable quality, readily available and easily accessible.

Data Governance

We want the data to be usable. How do we ensure it?

Data: Warehouse, Lake and Lakehouse

Yesterday, Today and Tomorrow of Data Processing

Data Fabric simply explained

Published: February 17, 2024

What? Enterprise wide consisted data management design. Why? Reduce the time to deliver data integration and interoperability How? Through metadata.

Data Analytics Storage in 2024

Storage is foundational. The choices are simplified by the maturing technolgies. Providing a technical overview

Centralized Vs Federated

Published: February 17, 2024

Coming Soon

Example of a data architecture policy

A more effective format that is both simple and truly serves the intent of a policy

publications

Collect - Extraction: Batch Transfer

Published in Processing, 2024

Batch Transfer: One of the most common scenarios of extraction. For now atleast

Collect - Extraction: Transfer Compressed Data

Published in Processing, 2024

Move Compresssed data

Collect - Extract and Load Patterns

Published in Processing, 2024

Transfer data from source to taget

Collect: Pre-ingest vs Post-ingest Processing

Published in Processing, 2024

I do not tend to draw hard lines between applying processing logic directly on the source system before extracting the data or performing transformations post-ingestion in an analytics platform. Both approaches are valid, depending on factors such as data volume, complexity, real-time requirements, and system architecture. However, most modern data scale needs require processing to be done post-ingestion.

Collect: Data Profiling

Published in Processing, 2024

Data profiling is essential for understanding the quality, structure, and consistency of data

Curate: Data Cleansing

Published in Processing, 2024

Deliver quality data

Data Products: Adopting Microservice Architecture Principles

Published in Product, 2024

By applying microservice principles, data products can be designed to be modular, scalable, and maintainable, providing greater flexibility and agility in data-driven environments

Data Platform - Enteprise Semantic Layer Requirements

Published in Data Platform, 2024

Deliver Data as an Organized, Unified and Consistent Product

Download Paper

Data Store: Raw Layer

Published in Processing, 2024

Keep the Raw Layer “Raw”

Collect: Data Capture

Published in Processing, 2024

Capture data from source system for processing in an Analytics System

Federated Data Management through Domain-Oriented Decentralized Data Ownership

Published in , 1900

Leverage Business Capability Maps for Data Domains

MDM Patterns. All are relevant and can coexist

Published in Master Data, 2024

Leverage Business Capability Maps for Data Domains

Measure Data Architecture

Published in Governance, 2024

Consistency on what we measure and how we measure data domains. An method with an example scenario

talks

Data Engineering Project Initiation Checklist

Published: March 01, 2019

Some upfront work is required to ensure the success of data engineering projects. I have used this checklist to provide a framework for collaborating with multiple stakeholders to define clear requirements and designs.

Snowflake Implementation Notes

Published: March 01, 2020

Virtual Warehouses

Cloud Storage: Best practices

Published: February 01, 2024

Buckets names: I am split on wheter to have have smart names which clear inform about the intent of the bucket and its files and the security concerns that may arise by doing so. If there is a need to hide the intent of buckets from possible attackers, we would need manage and enforce catalogs. However, I have seen the worst of both worlds in which the naming is gives enough and these buckets not being cataloged. I would recommend a naming coventions or rules to catalog bucket names and have audits to ensure compliance.

Parquet: Best practices demonstration

Published: February 01, 2024

A often overlooked feature of Parquet is its support for Interoperability which is key to enterprise data plaforms which serves different tools and systems, facilitating data exchange and integration. This is my take on Parquet best practices and I have used python-pyarrow to demonstrate them.

Trigger a function when a new file is uploaded to cloud storage

Published: March 01, 2024

Incremental Append Only Load with Airbyte Demo

Published: June 01, 2024

Incremental (Append and Deduplicate) Load with Airbyte Demo

Published: June 01, 2024

Slowly Changing Dimensions - Type 2 with Glue, Pyspark and Iceberg

Published: June 01, 2024

Extract: Batch Transfer From Onprem To Cloud (GCP)

Published: June 29, 2024

Table Formats Comparison Demo

Published: June 30, 2024

Avro vs Parquet vs CSV Demo

Published: July 01, 2024

Column Transformations for Staging

Published: July 01, 2024

Iceberg on AWS: Part 3 - Glue Spark Evolves Schema

Published: July 01, 2024

10 Terraform best practices - Simple enough and all projects need to include

Published: July 01, 2024

ConversationAI: Integrating DialogflowCX with backends data sources using webhooks

Published: August 17, 2024

ConversationAI: Importing Documents

Published: August 17, 2024

Install DynamoDB locally

The installables can be found at the aws wbesite:https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DynamoDBLocal.DownloadingAndRunning.html

Setup

Installed DBT Core locally. The install configuration can be described by the command: dbt debug
Installed DBT Utils

Kris NUNES

Posts by Collection

portfolio

publications

talks

Virtual Warehouses

teaching

Install DynamoDB locally

Setup