INFOSOFT IT SOLUTIONS - Hadoop Administration

Hadoop Administration Training

Home
Courses

Hadoop Administration Training

Introduction to Hadoop

Gain an overview of Hadoop, the open-source framework for distributed storage and processing of large datasets. Learn about its architecture, key components, and how it fits into the ecosystem of big data technologies.

Setting Up Hadoop

Learn how to install and configure a Hadoop cluster. Understand the requirements for hardware and software, and get hands-on experience with setting up Hadoop on both single-node and multi-node clusters.

Hadoop Distributed File System (HDFS)

Dive into the Hadoop Distributed File System (HDFS). Learn about its architecture, how it stores and manages data across a distributed environment, and how to perform operations such as file management and replication.

YARN (Yet Another Resource Negotiator)

Explore YARN, the resource management layer of Hadoop. Understand how YARN manages resources, schedules tasks, and monitors job execution across the cluster.

Hadoop MapReduce

Learn about Hadoop MapReduce, the programming model for processing large datasets. Understand the MapReduce framework, its components, and how to write, execute, and optimize MapReduce jobs.

Hadoop Ecosystem Components

Discover other important components of the Hadoop ecosystem, such as Hive, Pig, HBase, and ZooKeeper. Learn how these tools integrate with Hadoop to provide additional functionality for data processing and analysis.

Cluster Management and Monitoring

Learn how to manage and monitor a Hadoop cluster. Explore tools and techniques for cluster health monitoring, performance tuning, and troubleshooting common issues.

Data Security and Access Control

Understand the best practices for securing Hadoop clusters. Learn about data encryption, user authentication, and access control mechanisms to protect sensitive data and ensure compliance.

Backup and Recovery

Discover strategies for backing up and recovering data in a Hadoop environment. Learn about backup tools, data replication, and disaster recovery planning to ensure data integrity and availability.

Performance Optimization

Learn techniques for optimizing Hadoop performance. Understand how to configure Hadoop settings, tune job performance, and leverage hardware resources effectively to improve overall system efficiency.

Hands-On Labs and Projects

Engage in hands-on labs and projects to apply your Hadoop administration skills. Work on real-world scenarios to develop practical experience in managing and optimizing Hadoop clusters.

Hadoop Administration Syllabus

1. Introduction to Hadoop Administration

High Availability
Scaling
Advantages and Challenges

2. Introduction to Big Data

What is Big Data
Big Data Opportunities and Challenges
Characteristics of Big Data

3. Introduction to Hadoop Administration

Hadoop Administration Distributed File System
Comparing Hadoop Administration & SQL
Industries Using Hadoop Administration
Data Locality
Hadoop Administration Architecture
MapReduce & HDFS
Using the Hadoop Administration Single Node Image (Clone)

4. Hadoop Administration Distributed File System (HDFS)

HDFS Design & Concepts
Blocks, Name Nodes, and Data Nodes
HDFS High-Availability and HDFS Federation
Hadoop Administration DFS The Command-Line Interface
Basic File System Operations
Anatomy of File Read and File Write
Block Placement Policy and Modes
Configuration Files
Metadata, FS Image, Edit Log, Secondary Name Node, and Safe Mode
Adding and Decommissioning Data Nodes Dynamically
FSCK Utility (Block Report)
Overriding Default Configuration
HDFS Federation
ZOOKEEPER Leader Election Algorithm
Exercise and Small Use Case on HDFS

5. MapReduce

MapReduce Functional Programming Basics
Map and Reduce Basics
How MapReduce Works
Anatomy of a MapReduce Job Run
Legacy Architecture (Job Submission, Initialization, Task Assignment, Execution, Progress and Status Updates)
Job Completion and Failures
Shuffling and Sorting
Splits, Record Reader, Partition, Types of Partitions & Combiner
Optimization Techniques (Speculative Execution, JVM Reuse)
Types of Schedulers and Counters
Comparisons Between Old and New API
Getting Data from RDBMS into HDFS Using Custom Data Types
Distributed Cache and Streaming (Python, Ruby, R)
YARN
Sequential Files and Map Files
Enabling Compression Codecs
Map-Side Join with Distributed Cache
Types of I/O Formats (Multiple Outputs, NLineInputFormat)
Handling Small Files Using CombineFileInputFormat

6. MapReduce Programming – Java Programming

Hands-on "Word Count" in MapReduce (Standalone and Pseudo-Distribution Mode)
Sorting Files Using Hadoop Administration Configuration API
Emulating "grep" for Searching Inside a File
DBInput Format
Job Dependency API
Input Format API, Split API
Custom Data Type Creation in Hadoop Administration

7. NoSQL

ACID in RDBMS and BASE in NoSQL
CAP Theorem and Types of Consistency
Types of NoSQL Databases in Detail
Columnar Databases (HBase and Cassandra)
TTL, Bloom Filters, and Compensation

8. HBase

HBase Installation and Concepts
HBase Data Model and Comparison with RDBMS and NoSQL
Master & Region Servers
HBase Operations (DDL and DML) through Shell and Programming
Catalog Tables
Block Cache and Sharding
SPLITS
Data Modeling (Sequential, Salted, Promoted, Random Keys)
Java APIs and REST Interface
Client-Side Buffering and Processing 1 Million Records
HBase Counters
Enabling Replication and HBase RAW Scans
HBase Filters
Bulk Loading and Co-Processors (Endpoints and Observers)
Real-World Use Case (HDFS, MapReduce, HBase)

9. Hive

Hive Installation, Introduction, and Architecture
Hive Services, Hive Shell, Hive Server, and Hive Web Interface (HWI)
Meta Store, Hive QL
OLTP vs. OLAP
Working with Tables
Primitive and Complex Data Types
Working with Partitions
User Defined Functions
Hive Bucketed Tables and Sampling
External Partitioned Tables, Mapping Data to Partitions
Dynamic Partition
ORDER BY, DISTRIBUTE BY, and SORT BY Differences
Bucketing and Sorted Bucketing with Dynamic Partition
RC File
Indexes and Views
Map-Side Joins
Compression on Hive Tables and Migrating Hive Tables
Dynamic Substitution of Hive and Running Hive
Enabling Updates in Hive
Log Analysis on Hive
Access HBase Tables Using Hive
Hands-on Exercises

10. Pig

Pig Installation
Execution Types
Grunt Shell
Pig Latin
Data Processing
Schema on Read
Primitive and Complex Data Types
Tuple Schema, BAG Schema, MAP Schema
Loading and Storing
Filtering, Grouping, and Joining
Debugging Commands
Validations and Type Casting
Working with Functions
User Defined Functions
Types of Joins in Pig and Replicated Join
SPLITS and Multi-Query Execution
Error Handling, FLATTEN and ORDER BY
Parameter Substitution
Nested For Each
User Defined Functions, Dynamic Invokers, and Macros
Accessing HBase Using Pig, Load and Write JSON Data
Piggy Bank
Hands-on Exercises

11. Sqoop

Sqoop Installation
Import Data (Full Table, Subset, Target Directory, etc.)
Incremental Import (New Data, Last Imported Data)
Free Form Query Import
Export Data to RDBMS, Hive, and HBase
Hands-on Exercises

12. HCatalog

HCatalog Installation
Introduction to HCatalog
HCatalog with Pig, Hive, and MapReduce
Hands-on Exercises

13. Flume

Flume Installation
Introduction to Flume
Flume Agents: Sources, Channels, and Sinks
Log User Information Using Java Program into HDFS
Log User Information Using Java Program into HBase
Flume Commands
Use Case: Flume Data from Twitter to HDFS and HBase

14. More Ecosystems

HUE (Hortonworks and Cloudera)

15. Oozie

Workflow (Action, Start, End, Kill, Join, Fork)
Schedulers, Coordinators, Bundles
Scheduling Sqoop Jobs, Hive, MapReduce, Pig
Real-World Use Case: Top Websites by User Age

16. ZooKeeper

HBase Integration with Hive and Pig
Phoenix
Proof of Concept (POC)

17. Spark

Spark Overview
Linking with Spark, Initializing Spark
Using the Shell
Resilient Distributed Datasets (RDDs)
Parallelized Collections
External Datasets
RDD Operations
Basics, Passing Functions to Spark
Working with Key-Value Pairs
Transformations
Actions
RDD Persistence
Choosing Storage Level
Removing Data
Shared Variables
Broadcast Variables
Accumulators
Deploying to a Cluster
Unit Testing
Migrating from Pre-1.0 Versions of Spark
Where to Go from Here

Hadoop Administration Training

Hadoop Administration Training

Introduction to Hadoop

Setting Up Hadoop

Hadoop Distributed File System (HDFS)

YARN (Yet Another Resource Negotiator)

Hadoop MapReduce

Hadoop Ecosystem Components

Cluster Management and Monitoring

Data Security and Access Control

Backup and Recovery

Performance Optimization

Hands-On Labs and Projects

Hadoop Administration Syllabus

1. Introduction to Hadoop Administration

2. Introduction to Big Data

3. Introduction to Hadoop Administration

4. Hadoop Administration Distributed File System (HDFS)

5. MapReduce

6. MapReduce Programming – Java Programming

7. NoSQL

8. HBase

9. Hive

10. Pig

11. Sqoop

12. HCatalog

13. Flume

14. More Ecosystems

15. Oozie

16. ZooKeeper

17. Spark

Training

Basic Level Training

Advanced Level Training

Project Level Training

Total Training Period

Course Mode :

Course Fees :

Placement Benefit Services

Provide 100% job-oriented training

Develop multiple skill sets

Assist in project completion

Build ATS-friendly resumes

Add relevant experience to profiles

Build and enhance online profiles

Supply manpower to consultants

Supply manpower to companies

Prepare candidates for interviews

Add candidates to job groups

Send candidates to interviews

Provide job references

Assign candidates to contract jobs

Select candidates for internal projects

Note

100% Job Assurance Only

Daily online batches for employees

New course batches start every Monday