Monday, 17 October 2016

HDFS functionality:

        Name node maintains the cluster live state by listening to the data nodes. For each specified interval of time, data node will send its status to name node. This mechanism is called heart beat. If name node does not receive heart beat from any data node, name node assumes that the data node is down and stop sending tasks to that data node.

       Hadoop has different services to execute jobs and track the completion. Below are the some components.

Image result for hadoop architecture
(Image courtesy: dezyre.com)

Job:

       A program posted to hadoop for execution is called a job. 

Task:

      A job will be sent to different slave nodes to process the data resided on data nodes. These instances of a job is called as task.

Job tracker:

      Job tracker runs on Master node. Job tracker is responsible for gathering the status of each task from slave node and completion of the job.

Task tracker:

      Task tracker runs in slave node and is responsible for getting the status of the tasks running on particular slave node, so that slave node will send these statuses to job tracker.


       Initially client submits the program to process data to name node. Job tracker running on name node will get the metadata of the data which need to be processed. This metadata includes the data node and blocks information where the data is stored. Then the program will be posted to the slave nodes where the data is existing. The processes of the program running on several slave nodes are called as tasks.

      Task tracker is responsible for maintaining the status of the tasks. Task tracker sends these statuses to job tracker. When all the tasks belong to a job is completed, job tracker will mark the job as completed.

HDFS Architecture:

           HDFS has a master-slave architecture. Master node is called as Name node and Slave nodes are called as Data nodes. HDFS has one name node and multiple data nodes. Actual data is stored on data nodes and name node is responsible for information about data stored on data nodes.

           High level details are covered in this section. Detailed explanation will be given in later posts.


HDFS Architecture


Master Node:

           HDFS has one name node. This name node is responsible for the information about state of all data nodes. Name node is responsible for below tasks.
  1. Maintaining the information on status(running or down) of the data nodes.
  2. Stores the metadata of the files stored on data nodes.
  3. Sending requests to data nodes for replication of files.
  4. Responding to client requests about the metadata of the files and job status.
Data node:

           Data is stored in the disks where the data node is running. This node is responsible for writing and reading data from each disk.

Client: 

          Client is connected to both name node and data node.Client will post the programs to name node in turn name node will post these programs to data node. 

      For reading or writing files from data nodes, client has to send request to name node.Then name node will check meta data(wich name node is responsible for the file) and give the details about the files to client. Then client directly stores or reads the data from data node.

File parts:

     File are divided into parts and will be stored on different disks. Name node has information about how many parts a file is divided into and where they are stored.

Blocks:

         Like any other file system, HDFS stores data in memory blocks. But the block size of HDFS is very larger than other file systems. Default block size of Apache hadoop is 64 MB and Cloudera hadoop is 128 MB. 

          When there is data with size less than default block size, the whole memory of the block is locked for the data even if it is not needed. So the rest of the space is wasted. Name node will store the meta data of data stored in the blocks, so if the block size is less name node has to store large amount of metadata.


Saturday, 15 October 2016

Introduction to Hadoop:

Hadoop is a framework developed in java to store files in HDFS file system or access files files stored in HDFS.

HDFS: 

          HDFS stands for Hadoop Distributed Files System. This is file system like NTFS(used by windows) and FAT(used by UNIX/Linux operating systems) deals with disks and how the data is stored and accessed into/from  the disk.

Hadoop: 

          Hadoop is a framework which gives users with java packages or simply commands to interact with HDFS. 

Why Hadoop and HDFS: 

        HDFS is used for distributed computing with data stored with high replication factor on commodity hardware(Cheap hardware). HDFS is introduced to deal with the huge volumes of data which is norm for big organizations these days. Speed of accessing data is what makes HDFS a right solution for Big data.

Why Hadoop is effective: 
         
           In general terminology of disks, Seek time is defined as the time taken by write head of a hard disk to move from one location on hard disk to another. Seek time affects the data accessing speed of hard disk. As mentioned in "Hadoop: The definitive guide" storage capacity of hard disks increased by 1000 times from 20 years, where data access speed is increased only by 20 times. And it takes more than 2 hours to read one terabyte of data from a hard disk. This is a bottleneck while processing huge volumes of data.

          To address this issue, In hadoop each file is stored in number of disks by splitting the file into parts. Suppose if the default part fize of hadoop is set to 100 MB, Then a file which contains 10 GB of data will be split into 100 parts and stored in 100 different disks. When the application want to read the data of this file the access speed will be 100 times more than the traditional reading speed.

Issues with distributing the files:

           When we split a file into n parts and store them in n disks the risk of failure of disk or loss of data is increased by n times. To prevent this issue replication is introduced.

Replication:

           Each part of a file in hadoop is stored in multiple disks to avoid loss of data in case of disk failures. This is called replication. The number of copies a part is stored is called replication factor.