MapReduce Job Execution
Once the resource manager’s scheduler assign a resources to the task for a container on a particular node, the container is started up by the application master by contacting the node manager. The task whose main class is YarnChild is executed by a Java application .
It localizes the resources that the task needed before it can run the task. It includes the job configuration, any files from the distributed cache and JAR file. It finally runs the map or the reduce task. Any kind of bugs in the user-defined map and reduce functions (or even in YarnChild) don’t affect the node manager as YarnChild runs in a dedicated JVM. So it can’t be affected by a crash or hang.
All actions running in the same JVM as the task itself are performed by each task setup. These are determined by the OutputCommitter for the job. The commit action moves the task output to its final location from its initial position for a file-based jobs. When speculative execution is enabled, the commit protocol ensures that only one of the duplicate tasks is committed and the other one is aborted.
What does Streaming means?
Streaming reduce tasks and runs special map for the purpose of launching the user supplied executable and communicating with it. Using standard input and output streams, it communicates with the process. The Java process passes input key-value pairs to the external process during execution of the task. It runs the process through the user-defined map or reduce function and passes the output key-value pairs back to the Java process.
It is as if the child process ran the map or reduce code itself from the manager’s point of view. MapReduce jobs can take anytime from tens of second to hours to run, that’s why are long-running batches. It’s important for the user to get feedback on how the job is progressing because this can be a significant length of time. Each job including the task has a status including the state of the job or task, values of the job’s counters, progress of maps and reduces and the description or status message. These statuses change over the course of the job.
The task keeps track of its progress when a task is running like a part of the task is completed. This is the proportion of the input that has been processed for map tasks. It is a little more complex for the reduce task but the system can still estimate the proportion of the reduce input processed. When a task is running, it keeps track of its progress (i.e., the proportion of the task completed). For map tasks, this is the proportion of the input that has been processed. For reduce tasks, it’s a little more complex, but the system can still estimate the proportion of the reduce input processed.
Process involved –
- Read an input record in a mapper or reducer.
- Write an output record in a mapper or reducer.
- Set the status description.
- Increment a counter using Reporter’s incrCounter() method or Counter’s increment() method.
- Call Reporter’s or TaskAttemptContext’s progress() method.