Skip to content

Data Engineering

Knowledge base covering ETL/ELT, data pipelines, data warehousing, distributed computing, and the modern data stack.

Concepts and Architecture

Distributed Processing

Storage and Databases

  • hadoop hdfs - HDFS architecture, blocks, replication, small files problem
  • apache hive - SQL-on-Hadoop, Metastore, join strategies (MapJoin, SMB)
  • hbase - columnar NoSQL, row key, column families, versioning
  • clickhouse - columnar OLAP, partitions, granules, primary key, functions
  • clickhouse engines - MergeTree family, compression, skip indexes
  • greenplum mpp - MPP architecture, distribution, motion operators
  • postgresql administration - transactions, MVCC, PL/pgSQL, query optimization
  • mongodb nosql - document store, CAP theorem, aggregation pipelines

Infrastructure and Tools

Cross-Cutting

  • mlops feature store - MLflow, feature stores, model serving, CRISP-DM
  • sql for de - window functions, CTEs, recursive queries, optimization
  • python for de - database access, Pandas, functional programming, testing
  • [[sql-databases/index]] - deep SQL reference
  • [[python/index]] - Python language fundamentals
  • [[devops/index]] - CI/CD, infrastructure as code
  • [[architecture/index]] - system design patterns
  • [[data-science/index]] - ML and analytics
  • [[bi-analytics/index]] - BI tools and dashboards