Introduction to Hadoop:
Hadoop is a framework developed in java to store files in HDFS file system or access files files stored in HDFS.
HDFS:
HDFS stands for Hadoop Distributed Files System. This is file system like NTFS(used by windows) and FAT(used by UNIX/Linux operating systems) deals with disks and how the data is stored and accessed into/from the disk.
Hadoop:
Hadoop is a framework which gives users with java packages or simply commands to interact with HDFS.
Why Hadoop and HDFS:
HDFS is used for distributed computing with data stored with high replication factor on commodity hardware(Cheap hardware). HDFS is introduced to deal with the huge volumes of data which is norm for big organizations these days. Speed of accessing data is what makes HDFS a right solution for Big data.
Why Hadoop is effective:
In general terminology of disks, Seek time is defined as the time taken by write head of a hard disk to move from one location on hard disk to another. Seek time affects the data accessing speed of hard disk. As mentioned in "Hadoop: The definitive guide" storage capacity of hard disks increased by 1000 times from 20 years, where data access speed is increased only by 20 times. And it takes more than 2 hours to read one terabyte of data from a hard disk. This is a bottleneck while processing huge volumes of data.
To address this issue, In hadoop each file is stored in number of disks by splitting the file into parts. Suppose if the default part fize of hadoop is set to 100 MB, Then a file which contains 10 GB of data will be split into 100 parts and stored in 100 different disks. When the application want to read the data of this file the access speed will be 100 times more than the traditional reading speed.
Issues with distributing the files:
When we split a file into n parts and store them in n disks the risk of failure of disk or loss of data is increased by n times. To prevent this issue replication is introduced.
Replication:
Each part of a file in hadoop is stored in multiple disks to avoid loss of data in case of disk failures. This is called replication. The number of copies a part is stored is called replication factor.
No comments:
Post a Comment