Skip to content

Data Engineering

Knowledge base covering ETL/ELT, data pipelines, data warehousing, distributed computing, and the modern data stack.

Concepts and Architecture

Distributed Processing

Storage and Databases

  • hadoop hdfs - HDFS architecture, blocks, replication, small files problem
  • apache hive - SQL-on-Hadoop, Metastore, join strategies (MapJoin, SMB)
  • hbase - columnar NoSQL, row key, column families, versioning
  • clickhouse - columnar OLAP, partitions, granules, primary key, functions
  • clickhouse engines - MergeTree family, compression, skip indexes
  • greenplum mpp - MPP architecture, distribution, motion operators
  • postgresql administration - transactions, MVCC, PL/pgSQL, query optimization
  • mongodb nosql - document store, CAP theorem, aggregation pipelines

Infrastructure and Tools

Cross-Cutting

  • mlops feature store - MLflow, feature stores, model serving, CRISP-DM
  • sql for de - window functions, CTEs, recursive queries, optimization
  • python for de - database access, Pandas, functional programming, testing
  • index - deep SQL reference
  • index - Python language fundamentals
  • index - CI/CD, infrastructure as code
  • index - system design patterns
  • index - ML and analytics
  • index - BI tools and dashboards