1 Introducing Big Data & Hadoop
Learning Objective:
You will get introduced to real-world problems with Big data and will learn how to solve those problems with state-of-the-art tools. Understand how Hadoop offers solutions to traditional processing with its outstanding features. You will get to Know Hadoop background and different distributions of Hadoop available in the market. Prepare the Unix Box for the training.
1.1 Big Data Introduction
What is Big Data
Data Analytics
Big Data Challenges
Technologies supported by big data
What is Hadoop?
History of Hadoop
Basic Concepts
Future of Hadoop
The Hadoop Distributed File System
Anatomy of a Hadoop Cluster
Breakthroughs of Hadoop
Hadoop Distributions:
Apache Hadoop
Cloudera Hadoop
Horton Networks Hadoop
MapR Hadoop
Hands On:
Installation of Virtual Machine using VMPlayer on Host Machine. And work with Some basics Unix Commands needs for Hadoop.
2 Hadoop Daemon Processes
Learning Objective:
You will learn what are the different Daemons and their functionality at a high level.
Name Node
Data Node
Secondary Name Node
Job Tracker
Task Tracker
Hands On:
Creates a Unix Shell Script to run all the deamons at one time.
Starting HDFS and MR separately.
3 HDFS (Hadoop Distributed File System)
Learning Objective:
You will get to know how to Write and Read files in HDFS. Understand how Name Node, Data Node and Secondary Name Node take part in HDFS Architecture. You will also know different ways of Accessing HDFS data.
Blocks and Input Splits
Data Replication
Hadoop Rack Awareness
Cluster Architecture and Block Placement
Accessing HDFS
JAVA Approach
CLI Approach
Hands On:
Writes a shell Script which write and read Files in HDFS. Changes Replication factor at three levels. Use Java for working with HDFS.
Writes different HDFS Commands and also Admin Commands.
4 Hadoop Installation Modes and HDFS
Learning Objective:
You will learn different modes of Hadoop, understand Pseudo Mode from scratch and work with Configuration. You will learn functionality of different HDFS operation and Visual Representation of HDFS Read and Write actions with their Daemons Namenode and Data Node.
Local Mode
Pseudo-distributed Mode
Fully distributed mode
Pseudo Mode installation and configurations
HDFS basic file operations
Hands On:
Install Virtual Box Manager and install Hadoop in Pseudo distributed mode. Changes the different Configuration files required for Pseudo Distributed mode. Performs different File Operations on HDFS.
5 Hadoop Developer Tasks
Learning Objective:
Understand different Phases in Map Reduce including Map, Shuffling, Sorting and Reduce Phases.Get a deep understanding of Life Cycle of MR in YARN submission. Learn about Distributed Cache concept in detail with examples.
Write Wordcount MR Program and monitor the Job using Job Tracker and YARN Console. Also learn about more use cases.
Basic API Concepts
The Driver Class
The Mapper Class
The Reducer Class
The Combiner Class
The Partitioner Class
Examining a Sample MapReduce Program with several examples
Hadoop’s Streaming API
Hands On:
Learn about writing MR job from scratch, writing different Logics in Mapper and Reducer and submitting the MR Job in Standalone and Distributed mode.
Also learn about writing Word Count MR job, Calculating Average Salary of employee who meets certain conditions and Sales Calculation using MR.
6 Hadoop Ecosystems
6.1 PIG
Learning Objective:
Understand the importance of Pig in Big Data World, PIG architecture and PIG Latin commands for doing different complex operation on Relations, and also Pig UDF and Aggregation functions with piggy bank library. Learn how to pass dynamic arguments to Pig Scripts.
PIG concepts
Install and configure PIG on a cluster
PIG Vs MapReduce and SQL
Write sample PIG Latin scripts
Modes of running PIG
Hands On:
Login to Pig Grunt shell to issue Pig Latin commands in different Execution modes. Different ways of loading and transformation on Pig relations lazily. Registering UDF in grunt shell and perform Replicated Join Operations
6.2 HIVE
Learning Objective:
Understand importance of Hive in Big Data World. Different ways of configuring HIVE Metastore. Learn different types of tables in hive. Learn how to optimize hive jobs using Partitioning and Bucketing and Passing dynamic Arguments to Hive scripts. You will get an understanding of Joins,UDFS,Views etc.
Hive concepts
Hive architecture
Installing and configuring HIVE
Managed tables and external tables
Joins in HIVE
Multiple ways of inserting data in HIVE tables
CTAS, views, alter tables
User defined functions in HIVE
Hive UDF
Hands On:
Executes Hive Queries in different Modes. Creates Internal and External tables. Perform Query Optimization by creating tables with Partition and Bucketing Concepts. Run System defined and User Define Functions including Explode and Windows Functions.
Learning Objectives:
Learn how to import normally and Incrementally data from RDBMS to HDFS and HIVE tables, and also learn how to export the data from HDFS and HIVE table to RDBMS.Learns Architecture of Sqoop Import and Export.
SQOOP concepts
SQOOP architecture
Install and configure SQOOP
Connecting to RDBMS
Internal mechanism of import/export
Import data from Oracle/MySQL to HIVE
Export data to Oracle/MySQL
Other SQOOP commands.
Hands On:
Triggers Shell script to call Sqoop import and Export Commands. Learn to automate Sqoop Incremental imports with entering the last value of the appended Column. Run Sqoop export from HIVE table directly to RDBMS.
Learning Objectives:
Understand different types of NOSQL databases and CAP theorem. Learn different DDL and CRUD operations of HBASE. Understand Hbase Architecture and Zookeeper Importance in managing HBase. Learns Hbase Column Family optimization and client Side Buffering.
HBASE concepts
ZOOKEEPER concepts
HBASE and Region server architecture
File storage architecture
Defining Schema and basic operations
HBASE use cases
Hands On:
Create HBASE tables using Shell and perform CRUD operations with JAVA API. Change the column family properties and also perform sharding process. Also create tables with multiple splits to improve the performance of HBASE query.
Learning Objectives:
Understand Oozie Architecture and monitor Oozie Workflow using Oozie. Understand how Coordinator and Bundles work along with Workflow in Oozie. Also learn Oozie Commands to submit, Monitor and Kill the Workflow.
OOZIE concepts
OOZIE architecture
Workflow engine
Job coordinator
Installing and configuring OOZIE
HPDL and XML for creating Workflows
Nodes in OOZIE
Action nodes and Control nodes
Accessing OOZIE jobs through CLI, and web console
Develop and run sample workflows in OOZIE
Run MapReduce programs
Run HIVE scripts/jobs.
Hands on:
Create the Workflow to incremental Imports of Sqoop. Create the Workflow for Pig, Hive and Sqoop Exports. And also execute Coordinator to Schedule the Workflows.
Learning Objectives:
Understand Flume Architecture and its components Source, Channel and Sinks. Configure flume with Socket, File Sources and HDFS and Hbase Sink. Understand Fan In and Fan Out Architecture.
FLUME Concepts
FLUME Architecture
Installation and configurations
Executing FLUME jobs
Hands on:
Create flume Configurations files and configure with Different Source and Sinks.Stream Twitter Data and create hive table.
7 Data Analytics using Pentaho as an ETL tool
Learning Objective:
You will learn Pentaho Big Data Best Practices, Guidelines, and Techniques documents.
Data Analytics using Pentaho as an ETL tool
Big Data Integration with Zero Coding Required
Hands on:
You will use Pentaho as ETL tool for data analytics.
8 Integrations
Learning Objective:
You will see different Integrations among hadoop ecosystem in a Data engineering Flow. Also understand how important it is to create a flow for ETL process.
MapReduce and HIVE integration
MapReduce and HBASE integration
Java and HIVE integration
HIVE – HBASE Integration
Hands On:
Uses Storage Handlers for integrating HIVE and HBASE. Integrates HIVE and PIG as well.