Hadoop YARN Architecture

Anshuman Singh

Data Science

As data science and big data applications grew in complexity and scale, efficient resource management became a critical need within the Hadoop ecosystem. Traditional MapReduce had limitations in handling diverse workloads and dynamic resource allocation, prompting the development of a more flexible solution—YARN (Yet Another Resource Negotiator). Introduced in Hadoop 2.0, YARN acts as the operating system of Hadoop, managing resources and scheduling jobs across a cluster. It decouples resource management from data processing, enabling multi-tenancy, scalability, and support for various processing engines like MapReduce, Spark, and Tez. YARN’s architecture transformed Hadoop into a general-purpose data platform capable of running multiple applications concurrently.

What is YARN Architecture?

YARN (Yet Another Resource Negotiator) is the resource management layer of Hadoop 2.0 and beyond, designed to address the limitations of the original MapReduce framework. Unlike its predecessor, which tightly coupled resource management with data processing, YARN introduces a modular architecture that separates these concerns, providing greater flexibility and scalability.

At a high level, YARN sits between the Hadoop Distributed File System (HDFS) and the various processing engines, acting as a centralized resource manager. It dynamically allocates computing resources to different applications running on a Hadoop cluster, ensuring optimal utilization.

YARN allows Hadoop to support multiple types of data-processing frameworks—not just MapReduce but also Apache Spark, Tez, and others—making it a multi-tenant, general-purpose platform. This architectural shift enables Hadoop to run diverse workloads concurrently, improves resource efficiency, and provides the foundation for advanced scheduling, fault tolerance, and performance monitoring.

Main Components of Hadoop YARN Architecture

YARN’s architecture is composed of four core components that work together to manage resources and execute applications efficiently across a Hadoop cluster.

1. Resource Manager

The Resource Manager (RM) is the central authority responsible for resource allocation across the entire cluster. It operates on the master node and governs how applications are assigned computing resources. The RM is divided into two sub-components:

  • Scheduler: Allocates resources to various running applications based on defined policies like FIFO, Capacity, or Fair Scheduling. It is a pure scheduler and does not monitor or restart failed tasks.
  • Application Manager: Manages the submission of applications and negotiates with the Scheduler to launch Application Masters.

Together, they ensure optimal utilization of cluster resources, maintain system balance, and provide scalability across large deployments.

2. Node Manager

The Node Manager (NM) runs on every data node in the cluster. It is responsible for monitoring resource usage (CPU, memory, disk) on its local node and reporting this information to the Resource Manager. It also launches and manages containers as instructed by the Application Master. The Node Manager ensures that applications do not exceed allocated resources, maintaining system stability.

3. Application Master

Every application submitted to YARN has its own Application Master (AM). It is created and run within a container and is responsible for the entire lifecycle of the application—including requesting resources from the Resource Manager, working with Node Managers to execute tasks, monitoring progress, and recovering from failures. The AM acts as the orchestrator for a single job, ensuring all required tasks run and complete successfully.

4. Containers

Containers are the execution environments provided by the Node Manager for individual application tasks. Each container is allocated specific resources (memory, CPU) and runs a specific task such as a Map or Reduce operation. Containers enable task isolation and make YARN highly scalable and resource-efficient, allowing multiple jobs to run concurrently without conflict.

Application Workflow in Hadoop YARN

The execution workflow in YARN follows a coordinated, multi-step process that begins with application submission and ends with result aggregation:

  1. Job Submission: A client submits an application (e.g., a MapReduce or Spark job) to the Resource Manager (RM), which registers the application and assigns an Application ID.
  2. Application Master Launch: The Resource Manager selects a Node Manager (NM) and instructs it to launch the Application Master (AM) in a container. The AM takes charge of managing the job’s execution lifecycle.
  3. Resource Request & Allocation: The Application Master communicates with the Resource Manager to request containers for individual tasks. Based on available resources and scheduling policies, the RM allocates containers on various nodes.
  4. Task Execution: The AM sends task launch commands to the relevant Node Managers, which start the tasks within allocated containers. These containers execute application code—Map, Reduce, or other processing logic.
  5. Monitoring & Fault Tolerance: The AM monitors task progress. If a task fails, it can request new containers to restart execution.
  6. Completion & Result Aggregation: Once all tasks are complete, the AM notifies the Resource Manager. The client retrieves the results from HDFS or specified output locations.

This distributed coordination enables YARN to manage multiple applications efficiently and with fault tolerance.

Features of YARN Architecture

YARN introduces several powerful features that enhance the performance and flexibility of the Hadoop ecosystem:

  • Scalability: YARN efficiently manages thousands of nodes and jobs, making it suitable for clusters of all sizes—from small deployments to massive enterprise-level systems.
  • Flexibility: Unlike the original MapReduce-only architecture, YARN supports multiple data processing frameworks, including Apache Spark, Apache Tez, Hive on Tez, and more. This allows developers to run diverse workloads—batch, interactive, streaming—on a unified platform.
  • High Availability: YARN ensures cluster reliability through built-in failover mechanisms and automatic task recovery, minimizing downtime during node or task failures.
  • Efficient Resource Utilization: Through its dynamic container allocation system, YARN maximizes resource usage by distributing CPU and memory where it’s needed most, based on real-time demand.
  • Centralized Monitoring: YARN offers tools for real-time job tracking and resource monitoring, giving administrators visibility into cluster activity and health.

Advantages and Disadvantages of YARN

Advantages

  • Better Resource Management: YARN dynamically allocates resources based on application needs, improving overall cluster performance.
  • Application Isolation: Containers ensure task-level isolation, reducing conflicts and improving security.
  • Improved Cluster Utilization: Multiple data processing engines can run simultaneously, maximizing hardware usage.

Disadvantages

  • Complex Setup and Maintenance: Configuring and tuning YARN for optimal performance can be technically demanding, especially in large clusters.
  • Application Master Overhead: Each application requires a separate Application Master, which can introduce resource overhead and complicate monitoring for high-volume workloads.

Reference: