Big Data

Introduction

In today's enterprise world, managing and analyzing vast amounts of data is crucial for gaining competitive insights and driving informed decision-making. This course will equip you with a fundamental understanding of big data concepts and the Hadoop framework, including HDFS, MapReduce, and YARN, tailored to enterprise applications. By the end of this course, you will possess the foundational skills needed to implement and leverage big data technologies in an enterprise environment, enhancing your organization's data processing capabilities.

Educational goals - objectifs pédagogiques

Discover how to manage the spectacular growth of data in the company.
Explore the different components of a Big Data cluster and how they interact.
Understand Big Data paradigms.
Understand the advantages of Open Source solutions.
Develop a Big Data project from scratch.
Découvrir comment gérer la croissance spectaculaire des données en entreprise.
Parcourir les différents composants d’un cluster Big Data et comment ils interagissent.
Comprendre les paradigmes Big Data.
Comprendre les avantages des solutions Open Source.
Elaborer un projet Big Data à partir de zéro.

Prerequisites

SQL and Python programming, a good understanding of the Linux shell and Git.

Recommanded previous courses include DevOps and Git.

Complementary courses include Spark, streaming, and MLOps.

Modules

Module 1 (3h) - Big Data introduction

Information Systems
Distributed systems
Horizontal vs vertical scaling
Data structure
History of data
Distributed systems
The 3 Vs
Who needs Big Data?
Big Data clusters
Big Data clusters
The Hadoop Ecosystem
Data skils and profiles

Module 2 (3h) - Hadoop core: HDFS and YARN

Hadoop ecosystem introduction
Hadoop ecosystem projects
Hadoop core components
HDFS: presentation
HDFS: Master / Slave architecture
HDFS: Files storage
HDFS: Data replication example
HDFS: Client interactions
HDFS: Important properties
HDFS: Single Master mode vs High Availability
YARN: presentation
YARN: Architecture
YARN: Applications
YARN: Application lifecycle
YARN: Job scheduler and resource management

Module 3 (3h) - Distributed processing and the MapReduce framework

HDFS + YARN architecture
MapReduce: a framework
MapReduce: Application steps
MapReduce: Word count example
MapReduce: Distribution on a cluster
MapReduce: Important properties
MapReduce vs other frameworks

Module 4 (3h) - Data warehousing with Hive

OLAP vs OLTP
Hive introduction
Data querying on HDFS
Data file formats
Hive architecture and components
Example: daily ingestion of CSV file
Hive partitions
Bronze/silver/gold paradigm

Module 5 (3h) - NoSQL with HBase

NoSQL definition
Apache HBase
The CAP Theorem
HBase: introduction
HBase: data structure
HBase: data storage
HBase: architecture
HBase: data storage in RegionServers
HBase: partition tolerance and HA
HBase: querying

Module 6 (3h) - Stream processing with Kafka

Streaming definition
Streaming tools
Apache Kafka: Functionalities
Kafka: The Messaging System
Kafka: Topics
Kafka: Producers
Kafka: Consumers
Kafka: Data distribution
Apache Kafka performance
Additionnal usecases
Stream processing problematic
Stream processing: Dataflow model
Stream processing engines

Module 7 (3h) - Job orchestration with Oozie

Oozie introduction
DAGs of jobs
Oozie workflow declaration
Alternatives solutions

Module 8 (3h) - Architecture and security in distributed systems

Hadoop cluster topology
Security
Linux identity
Identiﬁcation with LDAP
Authentication with Kerberos
Authorization with Apache Ranger
Privacy: Encryption in Hadoop
Centralized gateway: Apache Knox
Governance

Module 9 (3h) - Cloud and alternative platforms

Types of cloud computing
On-premise vs Cloud
Kubernetes/Cloud native
Solutions and tools - IaaS
Solutions and tools - PaaS
Solutions and tools - ETL/Dataﬂow/Streaming
Solutions and tools - BI & Monitoring
Solutions and tools - ML Platforms
What’s next in Big Data? Data mesh?

Module 10 - 3h - Introduction to dataflow with NiFi

Intro to dataflow
Apache NiFi
NiFi Key Feaatures
NiFi Architecture
NiFi Extended Ecosysystem
NiFi Core Concepts
Considerations When Using NiFi

Module 11 - 2h - NiFi Basics: User Interface

Parts of the ui
NiFi components overview
Main configuration options for processors and connections

Practical projet

Contenu:

Big Data architecture
Connect the components
Alternative solutions
The future of data

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

index.md

index.md

Big Data

Introduction

Educational goals - objectifs pédagogiques

Prerequisites

Modules

Module 1 (3h) - Big Data introduction

Module 2 (3h) - Hadoop core: HDFS and YARN

Module 3 (3h) - Distributed processing and the MapReduce framework

Module 4 (3h) - Data warehousing with Hive

Module 5 (3h) - NoSQL with HBase

Module 6 (3h) - Stream processing with Kafka

Module 7 (3h) - Job orchestration with Oozie

Module 8 (3h) - Architecture and security in distributed systems

Module 9 (3h) - Cloud and alternative platforms

Module 10 - 3h - Introduction to dataflow with NiFi

Module 11 - 2h - NiFi Basics: User Interface

Practical projet

Files

index.md

Latest commit

History

index.md

File metadata and controls

Big Data

Introduction

Educational goals - objectifs pédagogiques

Prerequisites

Modules

Module 1 (3h) - Big Data introduction

Module 2 (3h) - Hadoop core: HDFS and YARN

Module 3 (3h) - Distributed processing and the MapReduce framework

Module 4 (3h) - Data warehousing with Hive

Module 5 (3h) - NoSQL with HBase

Module 6 (3h) - Stream processing with Kafka

Module 7 (3h) - Job orchestration with Oozie

Module 8 (3h) - Architecture and security in distributed systems

Module 9 (3h) - Cloud and alternative platforms

Module 10 - 3h - Introduction to dataflow with NiFi

Module 11 - 2h - NiFi Basics: User Interface

Practical projet