Satyajit Das

@satyajit

active 1 hour, 44 minutes ago
  • Satyajit Das posted an update in the group Group logo of DatabasesDatabases 2 weeks, 5 days ago

    A Deep dive into Hadoop Part-1 :-
    ================================

    What is a big data?
    It is a huge volume of data that cannot be processed using traditional database liek RDBMS and application programming within a given time. So, Hadoop is introduced. It is a big data processing system.
    The base of the Hadoop is Google File System and MapReduce(Python). It is an open-source file management system.

    RDBMS vs HADOOP Architecture :
    ===============================
    RDBMS:-
    =======

    1) RDBMS is a master slaves architecture where there is a main computer called masters and others slaves computer is attached to it, also called as (Clusters).
    3) It is used for smaller size of the data.
    4) We cannot store unlimited data in RDBMS “Clusters” as there is a limit in cluster size.
    5) RDBMS requires High-end configuration H/W for handling data, otherwise it crashes.So, Hardware cost is high and software it uses is licensed so every year we have renew it.
    6) RDBMS can support machines upto :-
    a) Mysql -128 machines.
    b) Oracle -256 machines.
    c) Teradata/vertica – 512 Machine.
    6) RDBMS will process the data in serially.
    6) If data is less then RDBMS is good, but for more data it crashes.

    Hadoop :-
    =========
    1) Hadoop is a distributed storage and processing system.
    2) Hadoop is free and it works on “Commodity Hardware”. It takes less than 20 times than server Hardware.
    3) Hadoop supports unlimited storage, there is no limit in the Hadoop clusters so we can take or join “n” number of machines together.
    4) In Hadoop the 10GB data is distributed into splits into multiple chunks (blocks) and it is stored in different distributed systems.
    By default blocks size is until Hadoop 1.x version SIZE of 64 MB, and from Hadoop 2.x onwards – size of each block is 128 MB.
    for example :- Let us take an example of 10GB of data of Hadoop of 2.x and each block size is 128 MB. Now total number of the blocks are : 10GB/128MB = 80 blocks. (80 different machines).
    5) Hadoop will process the data parallely.

    Facts about Hadoop:-
    ===================
    1) Similar to Google File system we have HDFS file (Hadoop Distributed File system).
    2) It is a primary data storage system which is used by Hadoop applications.
    3) It has the “NameNode” and “DataNode” architecture to implement a distributed file systems that provides high performance access to the data across highly scalable Hadoop clusters.
    4) “Doug Cutting” is the implementor of the Hadoop, and “Apache” is the vendor of the hadoop.
    5) Based on this Open-source there are some vendors which are providing some enterprise versions like :-

    a) Cloudera. By cloudera.org
    b) HortonWorks. By Hortonworks
    c) Big-Insights – By IBM.
    d) HD-Insights – by Microsofts.
    e) EMR (Elastic map reduce) – By Amazon.

    Hadoop Architecture :-
    =======================

    1) Hadoop is implemented using “JAVA”.
    2) Components of Hadoop from bottom of the architecture :-

    a) Hadoop Frameworks like (MapReduce, Pig, Hive, Flume, OOzie and Hbase.)
    b) Hadoop clusters (File handling systems).
    c) JVM (Java Virtual machine)
    d) Any flavor of the Linux. Like CentOS, Ubuntu etc.
    e) Hardware .

    3) Hadoop Architecture is Implemented in the Master-Slave Architecture.
    Master always receive the requests from the clients and assign it to the slave machines.
    There are two Masters one is “Master(Active)” and another is “Secondry Master(Passive)”. If “Active” goes down then “Passive” become the active and take the request from the clients.

    What is a Daemon?
    In multitasking computer, a daemon is a computer program that runs as a background process, rather than the direct control of the interactive users.
    Traditionally the process names of a daemon ends with ‘d’ eg : syslogd ( this daemon will implements the system logging facility ) and sshd is a daemon that serves incoming SSH connections.

    Hadoop architecture consists of 5 daemons.
    ==========================================

    1) Name Node
    2) Secondary Name Node
    3) Data Node

    For Hadoop 1.x version
    4) Job tracker
    5) Task tracker.

    From Hadoop 2.x version

    4) Resource manager
    5) Node manager.

    Note:- A systems often starts daemons during booting time which will respond to the network requests, hardware activities etc.

    Hadoop Master-slave Architecture :-
    =================================
    The Hadoop–Master Slave architecture has the following components.

    Masters Nodes:-
    =============
    1) Job Tracker(JT) or Resource Manager(RM). ( Primary Master).
    2) Name Node(NN). ( Secondry)
    3) Secondry Name Node(SNN). ( Secondry).

    Slaves Nodes:-
    =============
    4) Data Node(DN).
    5) Task Tracker(TT) or Node Manager (NM).

    What is Name Node?
    NN(NameNode) is responsible for just creating the metadata of all the blocks and sends it to the clients and clients is responsible for doing the “HDFS WRITE FLOW “ on the “Slaves Computers”.

    Hadoop has 2 core components which will do different tasks :-
    ============================================================

    1) HDFS Architecture for Storage:- HDFS ( Hadoop Distributed File Systems) for storage purpose.
    ==============================================================================================
    It consists of Master as NameNode(NN) for taking the Client’s storage request and check the meta-data information from the Secondry Name Node(SNN) and
    assign the task to Slaves as DataNode(DN) for storage purpose.

    2) Map Reduce Part :-
    =====================
    It consists of JobTracker(JT) or ResourceManager(RM) as the Masters and TaskTracker(TT) or NameNode(NN) as the Slaves.
    The Jobtracker or Resource Manager will take the processing request from the client and send it to NameNode.
    The NameNode will start processing the data in parallel fashion and after completion it will again give to the ResourceManager.

    Note:- This is just like the systems in the Corporate exits. The Manager will assign the task to the Team Lead, Team Lead distribute the
    task among employee, then Team Lead will collect the task and give it to the Resource manager.

    Tips for running the nodes :-
    ==============================
    1) It is always recommended to run the Data Node and Task Tracker/Node Manager(Slaves) on the same machine.
    2) Secondry Name Node(SNN) is not the backup of Name Node(NN) like the existed architecture.
    3) In Hadoop 2.x version, Apache introduced “High Availability”, we can take multiple NN(Name Node).

    HDFS Files system (Hadoop distributed File system) :-
    ===================================================
    A file system is a storage space in a operating system.

    Operating system file system:-
    =============================

    Block :- A small piece/chunk of file.

    In our system block size( the smallest unit of a file) depends on the Operating System.
    Each block size is 4KB in old OS and in new it is 8KB. Operating system file system starts as a part of OS process, so we cannot increase/decrease the operating system files system size.

    HDFS File system:-
    ==================
    A block is a piece/chunk of a file to process.

    Unlike our OS file systems whcih starts as a part of OS process, we cannot increase or decrese the OS file systems size. Each block size is 4KB in old OS and in new it is 8KB.
    HDFS filesystem is a user space filesystem, it runs after OS gets started. We can customize HDFS files system size.

    HDFS Files system block size are:
    =================================
    1) 64 MB untill Hadoop 1.x ( It can be configurable i:e increased/decreased as a multiple of 64MB i:e(64MB+64MB+128MB…….)
    2) 128MB untill Hadoop 2.x

    In the Hadoop installation path for example HADOOP_HOME\etc\hadoop\config there is HDFS-site.XML file in this file we can define block size, if we not specify the size
    it will consider the default size.(64 or 128 MB) as per version of Hadoop.

    What will happen when we will place the file into Hadoop File System?
    We have the configuration file like :
    1) Core-site.XML file :- It specifies filesystem type (HDFS or Local) etc.
    2) HDFS-site.XML file :- It defines the block size.
    3) MapReduced file:- It specifies the mapreduce type mapreduce1 or mapreduce2.

    Note :- Before hadoop 2.x, we have mr1 architecture. From hadoop 2.x we have mr2 architecture.(also known as YARN).

    Replication and Replication Factor:-
    ===================================

    In this topic we are going to see the following:

    1) What is Replication?
    2) What is Replication factor?
    3) Why Replication factor? or Replication is required?
    4) Types of Replications.

    1) What is a Replication?
    =========================
    It is the process of saving the duplicate copies of the same block in different racks of the different data-centre.
    Note:- By-Default the replication factor is “3”, but it is configurable.

    Meta-Data :-
    =============
    It contains the mapping b/w the file to the block, and block to the datanode list and some other permission like
    Owner, File permissions etc.
    It contains the information like block address, where it is stored and how many blocks are there.

    Note: While preparing the metadata NameNode(NN) will use the “Proximity Algorithm” .

    What is a proximity Algorithm?
    ==============================
    In Hadoop, the network is represented as tree. The distance between two nodes is the sum of their distances to their closest ancestor.
    The proximity algorithm helps us to find the closeness of the nodes. It is a program present in Hadoop Installation, written in Python. It will run when Hadoop is started.

    Proximity Algorithms has the information about:-

    1) Racks:- It is a collection of machines.
    2) Data-Centre:- It is a location or region, which contains the collections of Racks.
    3) Band-Width:- Contains the informations about machine, about networking, internet speed and data transfer rate.

    2) What is Replications Factor?
    ===============================
    How many duplicates copies of each block is stored in different machines is called Replications factor.

    a) The default replications factor is “3” but it can be configurable.
    b) There is a file called HDFS-site.XML, here we configure the replication factor.

    3) Why replication Factor? or Replication?
    ==========================================
    The comparison of robustness of available Harware are:

    PC H/W less than Commodity H/W less than Server H/W.

    Since we have discussed that Hadoop works on “Commodity Hardware”, so chance of failure is high compared to the server H/W.
    So, to resolve this problem, we maintain multiple copies of the same block in different machines. Default Replication factor is “3”.

    4) Types of Replications:-
    =========================

    Under Replications blocks:-
    ==========================
    If any machine is down then some replicas of the blocks will be lost, these blocks which are having less replications factors are called as “Under Replicated Blocks”.
    -> The NameNode(NN) will capture the under replications block information.
    -> In under-replications the NameNode(NN) will request the other slaves machines to replicate the block to maintain the “Replication Factor as “3” “.
    -> It will update the meta-data information, and try to achieve the replication factor so that data will be safe.

    Over Replications blocks:-
    ==========================
    After a machine broke down, then the NameNode(NN) will tell the other Datanode(DN) to maintain the replication factor of the block copies,
    But once the break-down system is re-stored then each block is having their respective copies that “3” but the prvious copies/duplicates it was created need to be removed these extra copies called (Over Replicated files.)
    The “NameNode” will send the signal to the DataNode to maintaion the (Replication Factor).
    So, the machine that will take the first request will remove the file and maintain the Replication Factor.

    Generally it is not a problem as the duplicates or extra copies will be removed easily.

    Heart-Beat Signals:-
    ====================
    There are two types of HeartBeat Signals:

    1) HDFS Heartbeat signal:-
    ===========================
    a) It is the signal that is sent by the Datanodes(Slaves) to the NameNode(Mater), informing about the states and the number of the blocks presented.
    b) The default Heart-Beat signal interval is 10 minutes. This time-interval can be configurable in the HDFS-Site.XML file).

    2) Map-Reduce Heart-Beat Signal:-
    =================================
    Every task tracker(TT) will send heart-beat signals to its master i:e Job Tracker(JT) saying about the state of the job executions.

    The Main core Daemon :-
    =======================

    Daemon is a program that runs at the backgroud of the HDFS system.

    In Hadoop the Daemons are :-

    1) NameNode.
    2) Secondry NameNode.
    3) DataNode.

    Untill Hadoop 1.x
    4) JOB Tracker
    5) Task Tracker.

    After Hadoop 2.x

    4) Resource Manager.
    5) Node Manager.

    1) NameNode(NN) :-
    =================

    1) It is a Master Daemon in HDFS, it receives the storage request from clients and
    prepares the meta-data.
    2) It moniters slaves machnies.

    2) Secondry NameNode(SNN):-
    ==========================
    It is not a backup of primary Node, like existed architecture. It stores the meta-data about the blocks.
    There are two main files :-
    a) Edit Log Files:- That modifies any operations on HDFS systems, the entries stores in the edit log file.It contains the information like Changing File permissions, changing Ownerships.
    b) Fsimage:- It is present in the Meta-Data. It contains all the mapping information.

    The Secondry NameNode will merge the two files from the metadata

    1) The Secondry Name Node will merge the two files i:e the metadata and edit log files and
    convert into one file (Fsimage), then it will update this file into the metadata, and removes the information from the “EditLog Filess”.

    2) SSN in every 1 hour, it will merges Fsimage and Edit logs file and this process is called as “Checkpointing”, this time(1 hours again is configurable).

    3) Due to its vital role SSN also requires the same configurations of the NameNode.

    3) DataNode :-
    ==============
    a) It is a slave daemon in HDFS system Architecture.
    b) It stores the data.
    c) It sends the Heart-Beat signals to the master node, about the completion of data storage.
    d) We can take/run Data node on multiple machine.

    4) JobTraker/Resource Manager :-
    ===============================
    a) It is a master Daemon in MapReduce Architecture.
    b) It receives the processing request from the client.
    c) The NameNode(NN) will communicate with the SNN for the meta-data information about the blocks, then it will assign the takes to TT/NM for storage .
    d) JT is a master in MR1 architecture.
    e) RM is a master in MR2 architecture ( also called as YARN.).
    f) It is recommended to take the JT/RM on separate machine.
    g) It will monitor each slaves machines.

    TaskManager or Node Manager :-
    ==============================

    a) It is a slave Daemon Machine.
    b) Responsibility of the slave is it will executes the jobs or Tasks assigned by its master.

    There will two types of commonications happening while storing and processing the data :-
    a) Master Master communication.
    b) Master Slave communication.

    Some Important Suggestion :-
    ===========================
    1) It is highly recommended to take the “GoodHardware” for the “NameNode” ( reliable Hardware).
    2) It is recommended to start all the masters “Daemon” on separate machines.
    3) It is highly recommended to have DataNode(DN) and TT/NM on the same machine.
    4) Admins should periodically takes the backup of Fsimage.
    5) In latest versions, “Secondry-Name-Node” is also called “Check-pointing Node”.
    6) All these Daemons are JAVA classes.

    High Availability Concept:-
    ==========================
    The information in the Hadoop system is highly available by using a shared system concept called “Shared Centralized Files System” called NFS.

    “NFS” is a centralized storage area where we can save and update the transactions, when a NameNode do any transactions then it will save the
    information at a common centralized place which is available to all the NameNode, so the other NameNodes will have the access to this file as well.

    For example :-

    Our ATM server has a common storage area, and at any place we will be able to withdraw money with full consistency as the information is available to all the ATM, s distributed system.

Skip to toolbar