What is HDFS (Hadoop Distributed File System)?

Dealing with large amounts of data is common in today’s world. When you interact with technology, you come across data at every step. Multiply it by 4.72 billion Internet users, and you have a mind-boggling number!

For storing and managing this ever-increasing data, various organizations consider Hadoop. It is an amazing platform that comes with the Hadoop Distributed File System (HDFS), which is the storage component of Hadoop. It’s undoubtedly Hadoop’s most significant component, and it deserves a thorough explanation. So let’s go through the blog post to learn about every detail of the Hadoop Distributed File System.

What is Hadoop Distributed File System (HDFS)?

Maintaining large data in a single machine is a tedious task. Therefore, the data is broken down into smaller parts and stored on different devices. To manage multiple small data packets, Hadoop applications use HDFS as their primary data storage system. It’s a distributed file system that allows you to retrieve application data quickly.

HDFS is a big data component that allows you to manage massive amounts of structured and unstructured data. It also splits the execution of huge data sets over a cluster of low-cost computers, making the management and storage budget-friendly. Some of the reasons you would want to use HDFS include:

It offers quick recovery from hardware problems. This is crucial because clusters of HDFS servers often tend to fail. However, HDFS is designed to identify the failure and recover on its own.
HDFS is developed for large data throughput, making it ideal for streaming data access.
HDFS expands to hundreds of nodes in a single cluster and offers extra aggregated data capacity for applications with gigabytes to terabytes of data.
HDFS is scalable among hardware platforms and works with many basic operating systems available in the market.
HDFS is cost-effective since it operates on desktop machines in a cluster. These are low-cost machines that can be purchased from any retailer.
HDFS is an extremely reliable file system that keeps numerous copies of data in different systems to ensure that it’s constantly available.

Salient Features of HDFS

1. Distributed file system

HDFS is a distributed file system (or distributed storage) that runs on commodity hardware and can manage massive amounts of data. You may extend the clusters to hundreds or thousands of nodes using HDFS.

2. Blocks

HDFS is developed to withstand very huge files. It divides these big files into smaller sections called Blocks. Each file is stored as a block, and each block has a particular amount of data that can be read or written. The block size is set to 128MB by default (but you can change that depending on your requirements). Hadoop splits up blocks and distributes them across different nodes.

3. Replication

HDFS data can be replicated from one HDFS server to the other. To offer fault tolerance, data blocks are duplicated, and the applications can define the number of replicas for a file. The replication factor can be set at the time of file creation and adjusted later. HDFS files are write-once and only include one writer at a certain moment. Each DataNode in the cluster sends a ‘Heartbeat’ and a Blockreport to the NameNode, which makes all choices on block replication.

4. Data reliability

HDFS provides fault tolerance by creating a replica of each data block on each node in a cluster. If a node goes down, users can still access the data on other nodes in the HDFS cluster that have a copy of the same data. By default, HDFS duplicates blocks three times.

5. Fault-tolerant

Data is replicated as it is maintained across numerous racks and nodes. This means that when one of the machines in a cluster fails, another node will be able to provide a replica of the data quickly.

6. Scalable

Resources can be scaled based on the size of the file system. Vertical and horizontal scalability techniques are both included in HDFS.

HDFS Architecture: Components of HDFS

1. HDFS Blocks

HDFS divides a file into small pieces and stores them on a distinct cluster machine. Users operating HDFS, on the other hand, would be completely unaware of this. To them, it appears like all of the data is being stored on a single system. In HDFS, these smaller components are called blocks. The advantages of storing data in chunks rather than the entire file are:

The files would be far too huge to fit on a single disc. As a result, it’s a good idea to distribute it throughout the cluster’s machines.
By making the use of parallelism, the system would enable good workload distribution and avoid a single machine from becoming a bottleneck.

2. NameNode

The HDFS cluster is built on a master-worker architecture, which implies there is one master node and numerous worker nodes. The master node in a cluster is called NameNode, and it operates on a different node.

It manages the filesystem namespace, which is the tree or hierarchy of files and directories on the filesystem.
For all files, it stores information such as file owners, file permissions, and so on.
It also determines where all of a file’s blocks are located and how big they are.

All of this data is saved in two files: FsImage and EditLog, which are kept on the local drive indefinitely.

The information regarding all the files and directories within the file system is stored by FsImage. It saves the replications, modifications, access permissions, the blocks that make up the files, and their sizes for files. It saves the modification time and permissions for folders.
The EditLog, on the other hand, keeps track of all the client’s write activities. To serve read requests, this is frequently updated to the in-memory metadata.
Whenever a client wants to write or read data from HDFS, it connects with the NameNode. The NameNode informs the client regarding the block locations and the process gets completed.

3. DataNodes

DataNodes are slave daemons that save data blocks that the NameNode assigns to them. As previously stated, the default configuration ensures that each data block has replications. You can adjust the number of copies, however, less than three is not recommended. Hadoop’s Rack Awareness policy states that replicas must be deployed in the following manner:

The total number of replicas must exceed the total number of racks. Racks are the physical collection of nodes Hadoop clusters
A DataNode can only store one replica of a data block at a time.
A single rack can only hold two replicas of a data block.

If you follow these recommendations, you will be able to:

Increase the amount of bandwidth available on the network.
Protect your data from being lost.
Boost performance and dependability.

However, besides these types of nodes, clusters would additionally contain a node known as the Secondary NameNode. Let’s have a look at what that means.

4. Secondary NameNode in HDFS

Let us assume that an individual has to restart a NameNode, which may be necessary if the failure occurs. It could require copying the FsImage from disc to memory. For keeping a track of all the events, we’d have to copy the most recent copy of EditLog to FsImage. However, if the node is restarted after a long time, the EditLog may have expanded in size.

This would imply that applying the transactions from the EditLog would take a long time. The filesystem would also be unavailable at this time. As a result, we use the Secondary NameNode to fix this issue. The Secondary NameNode is a node cluster whose major responsibility is to integrate the EditLog with the FsImage regularly and create checkpoints of the primary’s in-memory file system metadata. Checkpointing is another term for this.

The checkpointing operation, however, is computationally demanding and memory intensive, that is why the Secondary NameNode runs on a distinct cluster node. The Secondary NameNode, despite its name, does not function as a NameNode. It’s only for Checkpointing and storing the copies of the most recent FsImage.

Conclusion

Hadoop is a platform that deals with big data and includes the Hadoop Distributed File System (HDFS), which is Hadoop’s storage component. HDFS is Hadoop’s most important component, and it is used for dividing data into smaller chunks and stores them on many devices.

You should have a better grasp of HDFS and its function in the Apache Hadoop ecosystem after reading this blog. If you’re working with massive quantities of data or anticipate doing so in the future, Hadoop and HDFS can be of great help.