In today's enterprise world, managing and analyzing vast amounts of data is crucial for gaining competitive insights and driving informed decision-making. This course will equip you with a fundamental understanding of big data concepts and the Hadoop framework, including HDFS, MapReduce, and YARN, tailored to enterprise applications. By the end of this course, you will possess the foundational skills needed to implement and leverage big data technologies in an enterprise environment, enhancing your organization's data processing capabilities.
-
Discover how to manage the spectacular growth of data in the company.
-
Explore the different components of a Big Data cluster and how they interact.
-
Understand Big Data paradigms.
-
Understand the advantages of Open Source solutions.
-
Develop a Big Data project from scratch.
-
Découvrir comment gérer la croissance spectaculaire des données en entreprise.
-
Parcourir les différents composants d’un cluster Big Data et comment ils interagissent.
-
Comprendre les paradigmes Big Data.
-
Comprendre les avantages des solutions Open Source.
-
Elaborer un projet Big Data à partir de zéro.
SQL and Python programming, a good understanding of the Linux shell and Git.
Recommanded previous courses include DevOps and Git.
Complementary courses include Spark, streaming, and MLOps.
- Information Systems
- Distributed systems
- Horizontal vs vertical scaling
- Data structure
- History of data
- Distributed systems
- The 3 Vs
- Who needs Big Data?
- Big Data clusters
- Big Data clusters
- The Hadoop Ecosystem
- Data skils and profiles
- Hadoop ecosystem introduction
- Hadoop ecosystem projects
- Hadoop core components
- HDFS: presentation
- HDFS: Master / Slave architecture
- HDFS: Files storage
- HDFS: Data replication example
- HDFS: Client interactions
- HDFS: Important properties
- HDFS: Single Master mode vs High Availability
- YARN: presentation
- YARN: Architecture
- YARN: Applications
- YARN: Application lifecycle
- YARN: Job scheduler and resource management
- HDFS + YARN architecture
- MapReduce: a framework
- MapReduce: Application steps
- MapReduce: Word count example
- MapReduce: Distribution on a cluster
- MapReduce: Important properties
- MapReduce vs other frameworks
- OLAP vs OLTP
- Hive introduction
- Data querying on HDFS
- Data file formats
- Hive architecture and components
- Example: daily ingestion of CSV file
- Hive partitions
- Bronze/silver/gold paradigm
- NoSQL definition
- Apache HBase
- The CAP Theorem
- HBase: introduction
- HBase: data structure
- HBase: data storage
- HBase: architecture
- HBase: data storage in RegionServers
- HBase: partition tolerance and HA
- HBase: querying
- Streaming definition
- Streaming tools
- Apache Kafka: Functionalities
- Kafka: The Messaging System
- Kafka: Topics
- Kafka: Producers
- Kafka: Consumers
- Kafka: Data distribution
- Apache Kafka performance
- Additionnal usecases
- Stream processing problematic
- Stream processing: Dataflow model
- Stream processing engines
- Oozie introduction
- DAGs of jobs
- Oozie workflow declaration
- Alternatives solutions
- Hadoop cluster topology
- Security
- Linux identity
- Identification with LDAP
- Authentication with Kerberos
- Authorization with Apache Ranger
- Privacy: Encryption in Hadoop
- Centralized gateway: Apache Knox
- Governance
- Types of cloud computing
- On-premise vs Cloud
- Kubernetes/Cloud native
- Solutions and tools - IaaS
- Solutions and tools - PaaS
- Solutions and tools - ETL/Dataflow/Streaming
- Solutions and tools - BI & Monitoring
- Solutions and tools - ML Platforms
- What’s next in Big Data? Data mesh?
- Intro to dataflow
- Apache NiFi
- NiFi Key Feaatures
- NiFi Architecture
- NiFi Extended Ecosysystem
- NiFi Core Concepts
- Considerations When Using NiFi
- Parts of the ui
- NiFi components overview
- Main configuration options for processors and connections
Contenu:
- Big Data architecture
- Connect the components
- Alternative solutions
- The future of data