Resource Management and Job Scheduling in YARN

Apache Hadoop is a big data processing framework that allows for distributed storage and processing of large datasets across clusters of computers. One of the key components of Hadoop is YARN (Yet Another Resource Negotiator), which is responsible for resource management and job scheduling.

Resource Management

In a Hadoop cluster, there are multiple nodes that collectively form the cluster's resources. These resources include CPU, memory, disk space, and network bandwidth. YARN manages these resources efficiently to ensure that the jobs submitted by users are executed in a timely manner.

YARN divides the cluster's resources into two main components - a ResourceManager and NodeManagers. The ResourceManager is the central authority that receives job submissions and negotiates resources among the competing applications. It keeps track of available resources and assigns them to different applications based on their requirements.

On the other hand, NodeManagers run on individual nodes and are responsible for managing the resources available on those nodes. They report the available resources and the utilization back to the ResourceManager. NodeManagers track various metrics such as CPU usage, memory usage, and disk usage to provide accurate information about resource availability.

Job Scheduling

Job scheduling is a crucial aspect of resource management in YARN. YARN uses the Capacity Scheduler or the Fair Scheduler to schedule and allocate resources to different applications and jobs.

Capacity Scheduler

The Capacity Scheduler enables multiple organizations or users to share a Hadoop cluster fairly. It allows the cluster resources to be divided into separate queues, each with its own capacity limits. Organizations or users can submit jobs to these queues, and the scheduler ensures that each queue gets its allocated share of resources.

The Capacity Scheduler supports two scheduling modes: FIFO (First-In-First-Out) and Fair. In the FIFO mode, the scheduler orders applications based on their submission time, while in the Fair mode, the scheduler tries to allocate resources equally among all the active applications.

Fair Scheduler

The Fair Scheduler also supports sharing the cluster resources among multiple users or organizations. It assigns resources to applications in a fair and timely manner, regardless of the submission time.

The Fair Scheduler ensures fairness by dividing the cluster's resources into fair shares. Each application is allocated a fair share, and the scheduler continuously adjusts the shares based on the demand. If an application is idle, its share can be used by other applications. This way, the scheduler allows for efficient resource utilization.

Conclusion

Resource management and job scheduling in YARN play a crucial role in the efficient utilization of a Hadoop cluster. The ResourceManager and NodeManagers work together to manage the available resources, while the Capacity Scheduler and Fair Scheduler ensure fair allocation of resources to different applications. With these mechanisms in place, YARN enables efficient execution of big data processing tasks and supports the scalability and reliability of Apache Hadoop.


noob to master © copyleft