Skip to content

Latest commit

 

History

History
170 lines (133 loc) · 4.62 KB

index.md

File metadata and controls

170 lines (133 loc) · 4.62 KB

Big Data

Introduction

In today's enterprise world, managing and analyzing vast amounts of data is crucial for gaining competitive insights and driving informed decision-making. This course will equip you with a fundamental understanding of big data concepts and the Hadoop framework, including HDFS, MapReduce, and YARN, tailored to enterprise applications. By the end of this course, you will possess the foundational skills needed to implement and leverage big data technologies in an enterprise environment, enhancing your organization's data processing capabilities.

Educational goals - objectifs pédagogiques

  • Discover how to manage the spectacular growth of data in the company.

  • Explore the different components of a Big Data cluster and how they interact.

  • Understand Big Data paradigms.

  • Understand the advantages of Open Source solutions.

  • Develop a Big Data project from scratch.

  • Découvrir comment gérer la croissance spectaculaire des données en entreprise.

  • Parcourir les différents composants d’un cluster Big Data et comment ils interagissent.

  • Comprendre les paradigmes Big Data.

  • Comprendre les avantages des solutions Open Source.

  • Elaborer un projet Big Data à partir de zéro.

Prerequisites

SQL and Python programming, a good understanding of the Linux shell and Git.

Recommanded previous courses include DevOps and Git.

Complementary courses include Spark, streaming, and MLOps.

Modules

Module 1 (3h) - Big Data introduction

  • Information Systems
  • Distributed systems
  • Horizontal vs vertical scaling
  • Data structure
  • History of data
  • Distributed systems
  • The 3 Vs
  • Who needs Big Data?
  • Big Data clusters
  • Big Data clusters
  • The Hadoop Ecosystem
  • Data skils and profiles

Module 2 (3h) - Hadoop core: HDFS and YARN

  • Hadoop ecosystem introduction
  • Hadoop ecosystem projects
  • Hadoop core components
  • HDFS: presentation
  • HDFS: Master / Slave architecture
  • HDFS: Files storage
  • HDFS: Data replication example
  • HDFS: Client interactions
  • HDFS: Important properties
  • HDFS: Single Master mode vs High Availability
  • YARN: presentation
  • YARN: Architecture
  • YARN: Applications
  • YARN: Application lifecycle
  • YARN: Job scheduler and resource management

Module 3 (3h) - Distributed processing and the MapReduce framework

  • HDFS + YARN architecture
  • MapReduce: a framework
  • MapReduce: Application steps
  • MapReduce: Word count example
  • MapReduce: Distribution on a cluster
  • MapReduce: Important properties
  • MapReduce vs other frameworks

Module 4 (3h) - Data warehousing with Hive

  • OLAP vs OLTP
  • Hive introduction
  • Data querying on HDFS
  • Data file formats
  • Hive architecture and components
  • Example: daily ingestion of CSV file
  • Hive partitions
  • Bronze/silver/gold paradigm

Module 5 (3h) - NoSQL with HBase

  • NoSQL definition
  • Apache HBase
  • The CAP Theorem
  • HBase: introduction
  • HBase: data structure
  • HBase: data storage
  • HBase: architecture
  • HBase: data storage in RegionServers
  • HBase: partition tolerance and HA
  • HBase: querying

Module 6 (3h) - Stream processing with Kafka

  • Streaming definition
  • Streaming tools
  • Apache Kafka: Functionalities
  • Kafka: The Messaging System
  • Kafka: Topics
  • Kafka: Producers
  • Kafka: Consumers
  • Kafka: Data distribution
  • Apache Kafka performance
  • Additionnal usecases
  • Stream processing problematic
  • Stream processing: Dataflow model
  • Stream processing engines

Module 7 (3h) - Job orchestration with Oozie

  • Oozie introduction
  • DAGs of jobs
  • Oozie workflow declaration
  • Alternatives solutions

Module 8 (3h) - Architecture and security in distributed systems

  • Hadoop cluster topology
  • Security
  • Linux identity
  • Identification with LDAP
  • Authentication with Kerberos
  • Authorization with Apache Ranger
  • Privacy: Encryption in Hadoop
  • Centralized gateway: Apache Knox
  • Governance

Module 9 (3h) - Cloud and alternative platforms

  • Types of cloud computing
  • On-premise vs Cloud
  • Kubernetes/Cloud native
  • Solutions and tools - IaaS
  • Solutions and tools - PaaS
  • Solutions and tools - ETL/Dataflow/Streaming
  • Solutions and tools - BI & Monitoring
  • Solutions and tools - ML Platforms
  • What’s next in Big Data? Data mesh?

Module 10 - 3h - Introduction to dataflow with NiFi

  • Intro to dataflow
  • Apache NiFi
  • NiFi Key Feaatures
  • NiFi Architecture
  • NiFi Extended Ecosysystem
  • NiFi Core Concepts
  • Considerations When Using NiFi

Module 11 - 2h - NiFi Basics: User Interface

  • Parts of the ui
  • NiFi components overview
  • Main configuration options for processors and connections

Practical projet

Contenu:

  • Big Data architecture
  • Connect the components
  • Alternative solutions
  • The future of data