Hadoop Administration Training

Introduction to Hadoop

Gain an overview of Hadoop, the open-source framework for distributed storage and processing of large datasets. Learn about its architecture, key components, and how it fits into the ecosystem of big data technologies.

Setting Up Hadoop

Learn how to install and configure a Hadoop cluster. Understand the requirements for hardware and software, and get hands-on experience with setting up Hadoop on both single-node and multi-node clusters.

Hadoop Distributed File System (HDFS)

Dive into the Hadoop Distributed File System (HDFS). Learn about its architecture, how it stores and manages data across a distributed environment, and how to perform operations such as file management and replication.

YARN (Yet Another Resource Negotiator)

Explore YARN, the resource management layer of Hadoop. Understand how YARN manages resources, schedules tasks, and monitors job execution across the cluster.

Hadoop MapReduce

Learn about Hadoop MapReduce, the programming model for processing large datasets. Understand the MapReduce framework, its components, and how to write, execute, and optimize MapReduce jobs.

Hadoop Ecosystem Components

Discover other important components of the Hadoop ecosystem, such as Hive, Pig, HBase, and ZooKeeper. Learn how these tools integrate with Hadoop to provide additional functionality for data processing and analysis.

Cluster Management and Monitoring

Learn how to manage and monitor a Hadoop cluster. Explore tools and techniques for cluster health monitoring, performance tuning, and troubleshooting common issues.

Data Security and Access Control

Understand the best practices for securing Hadoop clusters. Learn about data encryption, user authentication, and access control mechanisms to protect sensitive data and ensure compliance.

Backup and Recovery

Discover strategies for backing up and recovering data in a Hadoop environment. Learn about backup tools, data replication, and disaster recovery planning to ensure data integrity and availability.

Performance Optimization

Learn techniques for optimizing Hadoop performance. Understand how to configure Hadoop settings, tune job performance, and leverage hardware resources effectively to improve overall system efficiency.

Hands-On Labs and Projects

Engage in hands-on labs and projects to apply your Hadoop administration skills. Work on real-world scenarios to develop practical experience in managing and optimizing Hadoop clusters.

Hadoop Administration Syllabus

1. Introduction to Hadoop Administration

  • High Availability
  • Scaling
  • Advantages and Challenges

2. Introduction to Big Data

  • What is Big Data
  • Big Data Opportunities and Challenges
  • Characteristics of Big Data

3. Introduction to Hadoop Administration

  • Hadoop Administration Distributed File System
  • Comparing Hadoop Administration & SQL
  • Industries Using Hadoop Administration
  • Data Locality
  • Hadoop Administration Architecture
  • MapReduce & HDFS
  • Using the Hadoop Administration Single Node Image (Clone)

4. Hadoop Administration Distributed File System (HDFS)

  • HDFS Design & Concepts
  • Blocks, Name Nodes, and Data Nodes
  • HDFS High-Availability and HDFS Federation
  • Hadoop Administration DFS The Command-Line Interface
  • Basic File System Operations
  • Anatomy of File Read and File Write
  • Block Placement Policy and Modes
  • Configuration Files
  • Metadata, FS Image, Edit Log, Secondary Name Node, and Safe Mode
  • Adding and Decommissioning Data Nodes Dynamically
  • FSCK Utility (Block Report)
  • Overriding Default Configuration
  • HDFS Federation
  • ZOOKEEPER Leader Election Algorithm
  • Exercise and Small Use Case on HDFS

5. MapReduce

  • MapReduce Functional Programming Basics
  • Map and Reduce Basics
  • How MapReduce Works
  • Anatomy of a MapReduce Job Run
  • Legacy Architecture (Job Submission, Initialization, Task Assignment, Execution, Progress and Status Updates)
  • Job Completion and Failures
  • Shuffling and Sorting
  • Splits, Record Reader, Partition, Types of Partitions & Combiner
  • Optimization Techniques (Speculative Execution, JVM Reuse)
  • Types of Schedulers and Counters
  • Comparisons Between Old and New API
  • Getting Data from RDBMS into HDFS Using Custom Data Types
  • Distributed Cache and Streaming (Python, Ruby, R)
  • YARN
  • Sequential Files and Map Files
  • Enabling Compression Codecs
  • Map-Side Join with Distributed Cache
  • Types of I/O Formats (Multiple Outputs, NLineInputFormat)
  • Handling Small Files Using CombineFileInputFormat

6. MapReduce Programming – Java Programming

  • Hands-on "Word Count" in MapReduce (Standalone and Pseudo-Distribution Mode)
  • Sorting Files Using Hadoop Administration Configuration API
  • Emulating "grep" for Searching Inside a File
  • DBInput Format
  • Job Dependency API
  • Input Format API, Split API
  • Custom Data Type Creation in Hadoop Administration

7. NoSQL

  • ACID in RDBMS and BASE in NoSQL
  • CAP Theorem and Types of Consistency
  • Types of NoSQL Databases in Detail
  • Columnar Databases (HBase and Cassandra)
  • TTL, Bloom Filters, and Compensation

8. HBase

  • HBase Installation and Concepts
  • HBase Data Model and Comparison with RDBMS and NoSQL
  • Master & Region Servers
  • HBase Operations (DDL and DML) through Shell and Programming
  • Catalog Tables
  • Block Cache and Sharding
  • SPLITS
  • Data Modeling (Sequential, Salted, Promoted, Random Keys)
  • Java APIs and REST Interface
  • Client-Side Buffering and Processing 1 Million Records
  • HBase Counters
  • Enabling Replication and HBase RAW Scans
  • HBase Filters
  • Bulk Loading and Co-Processors (Endpoints and Observers)
  • Real-World Use Case (HDFS, MapReduce, HBase)

9. Hive

  • Hive Installation, Introduction, and Architecture
  • Hive Services, Hive Shell, Hive Server, and Hive Web Interface (HWI)
  • Meta Store, Hive QL
  • OLTP vs. OLAP
  • Working with Tables
  • Primitive and Complex Data Types
  • Working with Partitions
  • User Defined Functions
  • Hive Bucketed Tables and Sampling
  • External Partitioned Tables, Mapping Data to Partitions
  • Dynamic Partition
  • ORDER BY, DISTRIBUTE BY, and SORT BY Differences
  • Bucketing and Sorted Bucketing with Dynamic Partition
  • RC File
  • Indexes and Views
  • Map-Side Joins
  • Compression on Hive Tables and Migrating Hive Tables
  • Dynamic Substitution of Hive and Running Hive
  • Enabling Updates in Hive
  • Log Analysis on Hive
  • Access HBase Tables Using Hive
  • Hands-on Exercises

10. Pig

  • Pig Installation
  • Execution Types
  • Grunt Shell
  • Pig Latin
  • Data Processing
  • Schema on Read
  • Primitive and Complex Data Types
  • Tuple Schema, BAG Schema, MAP Schema
  • Loading and Storing
  • Filtering, Grouping, and Joining
  • Debugging Commands
  • Validations and Type Casting
  • Working with Functions
  • User Defined Functions
  • Types of Joins in Pig and Replicated Join
  • SPLITS and Multi-Query Execution
  • Error Handling, FLATTEN and ORDER BY
  • Parameter Substitution
  • Nested For Each
  • User Defined Functions, Dynamic Invokers, and Macros
  • Accessing HBase Using Pig, Load and Write JSON Data
  • Piggy Bank
  • Hands-on Exercises

11. Sqoop

  • Sqoop Installation
  • Import Data (Full Table, Subset, Target Directory, etc.)
  • Incremental Import (New Data, Last Imported Data)
  • Free Form Query Import
  • Export Data to RDBMS, Hive, and HBase
  • Hands-on Exercises

12. HCatalog

  • HCatalog Installation
  • Introduction to HCatalog
  • HCatalog with Pig, Hive, and MapReduce
  • Hands-on Exercises

13. Flume

  • Flume Installation
  • Introduction to Flume
  • Flume Agents: Sources, Channels, and Sinks
  • Log User Information Using Java Program into HDFS
  • Log User Information Using Java Program into HBase
  • Flume Commands
  • Use Case: Flume Data from Twitter to HDFS and HBase

14. More Ecosystems

  • HUE (Hortonworks and Cloudera)

15. Oozie

  • Workflow (Action, Start, End, Kill, Join, Fork)
  • Schedulers, Coordinators, Bundles
  • Scheduling Sqoop Jobs, Hive, MapReduce, Pig
  • Real-World Use Case: Top Websites by User Age

16. ZooKeeper

  • HBase Integration with Hive and Pig
  • Phoenix
  • Proof of Concept (POC)

17. Spark

  • Spark Overview
  • Linking with Spark, Initializing Spark
  • Using the Shell
  • Resilient Distributed Datasets (RDDs)
  • Parallelized Collections
  • External Datasets
  • RDD Operations
  • Basics, Passing Functions to Spark
  • Working with Key-Value Pairs
  • Transformations
  • Actions
  • RDD Persistence
  • Choosing Storage Level
  • Removing Data
  • Shared Variables
  • Broadcast Variables
  • Accumulators
  • Deploying to a Cluster
  • Unit Testing
  • Migrating from Pre-1.0 Versions of Spark
  • Where to Go from Here

Training

Basic Level Training

Duration : 1 Month

Advanced Level Training

Duration : 1 Month

Project Level Training

Duration : 1 Month

Total Training Period

Duration : 3 Months

Course Mode :

Available Online / Offline

Course Fees :

Please contact the office for details

Placement Benefit Services

Provide 100% job-oriented training
Develop multiple skill sets
Assist in project completion
Build ATS-friendly resumes
Add relevant experience to profiles
Build and enhance online profiles
Supply manpower to consultants
Supply manpower to companies
Prepare candidates for interviews
Add candidates to job groups
Send candidates to interviews
Provide job references
Assign candidates to contract jobs
Select candidates for internal projects

Note

100% Job Assurance Only
Daily online batches for employees
New course batches start every Monday