Satyajit Das

@satyajit

active 30 minutes ago
  • Satyajit Das posted an update in the group Group logo of DatabasesDatabases 3 weeks, 3 days ago

    A Deep dive into Hadoop Part-2 :-
    ================================

    Hadoop Installations :-
    =====================
    Hadoop is implemented based on Java.
    Before installing the hadoop we need to install java.

    There are 3 installation modes :
    1) Local/Standalone mode.
    2) Sudo distribution mode.
    3) Fully distribution mode.

    Local/Standalone Mode :-
    ======================
    a) All the Daemons runs on the single JVM(NameNode(NN),SecondryNameNode(SNN),JobTracker(JT)).
    b) It uses Local File systems not the HDFS system.
    c) It is used in Development/Debugging.

    Sudo Distributions Mode :-
    ========================
    a) Each Daemons runs on a separate JVM.
    b) Each Daemons thinks that it is running on a separate machine.
    c) It uses the HDFS File systems.
    d) It is used for both the development and testing purpose.

    Fully Distributions Mode :-
    =========================
    a) All the Daemons runs on the different JVMs and each JVMs is present in different machine.
    b) Some Daemons runs on multiple machines.

    Note:- “Standalone mode” is the default mode of operation of Hadoop and it runs on a single node (a node is your machine).

    HDFS-WRITE-FLOW and HDFS-READ-FLOW :-
    ====================================

    HDFS Writeflow :-
    ===============
    When HDFS write-flow happens then the client(Users) will contact with the NameNode (Master) then NameNode will get the information about the blocks from SNN about the MetaData.
    Once the Client has the information about all the blocks then the Client will perform the write operation.

    But before creating the write operation the Hadoop system will create two separate process called as below :-

    1) Data Queue :-
    ==============
    It is Responsible to get the data and the client write the data in a sorted way of the block they will create a pipeline.

    2) Acknowledgement Queue :-
    =========================
    Once the data is written in blocks then an acknowledgement signal will be send to make sure that the written operations is done successfully.

    Note :- For every block a separate Pipeline (workflow) is created.

    HDFS-Read-Flow :-
    =================
    The client want to read the blocks, then it will send the request to “NameNode”, the NameNode will read the data from the “metadata” and send to the client and
    client will read from the nearest machine where the block is stored (Nearest machine means the Machine which latency is lowest, whose response is fast).
    It uses the proximity algorithm i:e the neart node and it depends on the BandWidth and the speed of each DataNode.

    There are three installations steps :-
    ====================================
    1) Pre-Installation steps :-
    —————————
    a) Linux installation
    b) Java installation and JAVA-HOME and PATH.

    2) Installation steps :-
    ———————-
    a) Downloading hadoop
    b) Extracting hadoop
    c) Set HADOOP-HOME and PATH.

    3) Post Installation steps :-
    ————————-
    a) Configure Hadoop
    b) Create passwordless SSH.
    c) Start and stop Hadoop Daemons.

    # Hadoop configuration files contains the following files :-
    ===========================================================
    a) Core-site.xml :- Here we provide file system type and NameNode url.
    b) Hdfs-site.xml :- We provide HDFS filesystem block size( Otherwise it will take the default block size i:e 128MB).
    c) Maped-site.xml:- In this file we specify which type of Mapreduce architecture we are using. ( It will be discussed below).
    i:e if it is mar1 or mar2 (YARN Architecture).
    d) YARN-site.xml :- We provide architecture details, and create passwordless SSH.

    Note:- It is important to put the passwordless, so when the various daemons will communicate with each others then they will ask
    passwords to communicate with other daemons.

    To start or stop daemons :-
    =========================
    HADOOP-HOME/SBIN contains all the scripts.

    1) Start–all.sh :– To start all the daemons .
    2) Stop–all.sh :- To stop all the daemons.

    The Start–all.sh will start the below daemons :-

    a) NameNode.
    b) SecondryNameNode.
    c) DataNode.

    3) Start –yarn-sh :- It will start the below daemons.

    a) Resource Manager.
    b) Node Manager.

    In Apache Hadoop installations we need to install all the software manually.

    1) Only Hadoop:
    2) Frameworks of Hadoop:
    a) Pig.
    b) Hive.
    c) Sqoop.
    d) Flume.
    e) OOZIE.
    f) Hbase.

    The Default File Permissions in HDFS :-
    ======================================
    Owner/User -rw 6
    Group –r 4
    Others -r 4

    Default File Permissions for Directory :-
    ======================================

    d rwx r-x r-x
    owner/user –7(rwx) (4+2+1)
    group –5(r-x) (4+0+1)
    others –5(r-x) (4+0+1)

    ‘r’= Read
    ‘w’= Write
    ‘x’= Execute

    Mapreduce :-
    ============
    MapReduce is a processing Engine using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity hardware in a reliable manner.
    It is a algorithm that has two phase or parts.

    Mapreduce has two phases : –
    1) Mapper phase :-
    ==================
    The Mapper process the files, gives a HDFS file and put in the intermediate results in a separate machine.

    2) Reducer phase :-
    ==================
    Mapper will process the intermediate file and stores the intermediate result into reducer and then sends it to the HDFS system.

    Note :- Hadoop is a bigdata framework and Mapreduce is a processing Engine, HDFS is a hadoop distributed file system.

    In Mapreduce :-
    ===============
    In Mapreduce we write logic to split the data into multiple words based on delimiter ‘,’ then we write words into a KEY and VALUE pair.

    The output of the MapReduce is the Count of each word repeated how many times.

    Example1:- Hadoop is a bigdata framework and Mapreduce is a processing engine.

    Final O/P of Mapreduce is :-
    =====================

    a 2
    bigdata 1
    engine 1
    framework 1
    Hadoop 1
    and 1
    is 2
    Mapreduce 1
    processing 1

    Mapper phase :-
    ===============

    In mapper phase, it will splits the data into multiple words based on the delimitor, based on the requirement we prepare the key, value pairs.

    Eg: word count
    Or key value
    Or word number

    In Reducer :-
    =============

    In Reducer it will perform 3 operations .

    1) Grouping Phase
    2) Sorting Phase ( based on the alphabetical order).
    3) Aggregations Phase

    Note :- First two operations ( grouping and sorting ) is automatically performed by the “Hadoop” system internally we only write the “Aggregation Logic”.

    1) Grouping Phase :-
    ====================

    Note : It will prepare map object.
    Key value (list) it will create a value list.
    Hadoop
    Is
    a
    Bigdata
    Frameworks
    Mapreduce
    Engine
    Processing
    and

    2) Sorting order :-
    ==================

    It will do the alphabetical sorting.
    a
    bigdata
    engine
    frameworks
    hadoop
    is
    Mapreduce
    processing
    and

    3) Aggregations Logic :-
    =====================
    We read each key, value pairs and for each key, we perform sum of the values.

    a 2
    bigdata 1
    engine 1
    frameworks 1
    hadoop 1
    is 2
    Mapreduce 1
    Process 1
    and 1

    So, after doing the aggregations logic, the data is stored in the HDFS system in the above format.

    Note :- In Map-Reduce applications we write 3 programs.

    1) User defined Mappers.
    2) User defined Reducers.
    3) Driver programs.

    Driver Programs :-
    ==================

    1) The Driver program is the main program has the main method.
    2) It is a main program ( which has the main () functions ).
    3) Execution starts from this program.
    4) In this program we will specify the following things.
    a) Which is mapper class.
    b) Which is Reducer class.
    c) Where is the input path locations.
    d) Where is the o/p path locations.
    e) What is the output key and what is the output value?

    So, our main focus always in develop the map-reduce application, for this we can use JAVA or Python.

    Hadoop Datatypes :-
    ===================
    For every datatype in the java we have the Hadoop datatype.

    JAVA HADOOP

    Byte Bytewritable
    Short Shortwritable
    Int intwritable
    Long Longwritable
    Float Floatwritable
    Double Doublewritable
    Char Characterwritable
    Boolean Booleanwritable
    String text

    Why different Data-Types for Hadoop?
    To transfer the data from one machine to other machine the we use the concept of serializations so te speed up the data-transfer rate.

    ==========
    Pig :-
    ==========
    Some Important points about Pig.

    a) It is a dataflow language it has its own language called “PIG-LATIN”.
    b) Pig is implemented by “YAHOO”.
    c) Pig syntaxes are similar to SQL.
    d) Pig statements are internally converted to the “Map-Reduce” applications.
    e) Pig has its own shell called as “grunt” (“grunt” is the noise of the “pig” so the name of the pig shell is “pig”).
    f) Grunt shell only understand the pig-latin instructions, before spark we use the pig for processing.
    “Pig” has two modes.
    1) Local modes
    2) HDFS modes.

    In Local mode it will take the local file as input and stores the o/p into the HDFS file-system.

    In HDFS mode, pig takes the HDFS file frame and writes/stores the results into the HDFS filesystem.

    How to open the pig file in Local mode ?
    # pig –x local.

    To load the file :-
    Relation = load filename as (column_name : datatype )

    Note : Here relation is like a variable and it is used to store the file.
    Comment = load

    To execute relations we have 2 commands.

    # dump = executes the relations and display the o/p in the console.
    # store = it will execute the relation and stores the o/p in some other locations.
    Untill, we store or dump it will not execute any statements, it will only prepare the plan to execute.

    1) Loading the file (R1).
    2) R2=Performing act1 on R1.
    3) R3 = act2 on R2.
    4) R4 = act3 on R3.
    5) Dump/store the file R4.

    Note1 :- All the statements depends on the previous statements so, it is called as “DATA-FLOW” language as the data is flowing.

    Note2:- Whenever we use the dump or store command then PIG will convert all the statements into the mapreduce statements.

    Note3:- The o/p of the pig will be “Tuple”.

    Analogy of terminology :-
    ========================
    RDBMS table : PIG-TERMINOLOGY.

    Tablename Relations.
    Table Bag.
    Rows Tuple.
    Columns Fields.

    Tuple : Collections of the Fields is called tuple and collections of the tuple is called as “Fields”.

    Bag = (Collections (Tuple =Collections (fields))).

    Example: ( m, { ( 1,Satyajit,10000,m ), ( 2, Murugan, 20000, m), (3, Kishore, 30000, m) }, ( f, { ( 4, Ramu, 30000, m ), ( 5, Ravi, 400000,f) } ).

    Note :-
    ========
    Flume :- It is also a data injestion tool, No one is using flume.
    OOZIE :- For defining the workflow and scheduling tasks.

    =============
    HIVE :-
    =============

    1) Hive is a warehouse Environment on the top of the hdfs file system.
    2) Hive has its own language called as “HQL” also called as “Hive query language”.
    3) Hive is implemented by “Facebook “ and the syntax is very similar like “MySql”.
    4) Later “Facebook” donated Hive to Apache to make it available to everyone.

    5) Hive have 2 components :-
    1) The Hive Metastore.
    2) Hive Warehouse.

    6) Hive Metastore :- is any RDBMS, the default hive metastore is “derby” database.

    7) We can configure the meta-store as MySql or Oracle or Postgrees. It contains the metadata or schema of a table.

    8) Hive Warehouse :- It is a HDFS filesystem location where the actual data resides.

    ===============================
    Apache Hadoop Commands for HIVE :-
    ================================
    1) Start-dfs.sh; // used to start the HDFS file system.
    2) Start-yarn.sh; // it is used to start the yarn.
    3) JPS ; // This command will show all the running process.
    4) Clear; // will clear all the data on the screen.
    5) Stop-yarn.sh // this command will stop all the YARN process like Task-Manager ( Data node ) and Resources Manager ( Name-Node) .
    6) Nano ~/.bashrc // this command will open a separate window and you can set all the properties here , the connections , or routing to other dependencies and their.
    7) Exit () command is used to come out from the command windows.
    8) Hive –version ( When we create hive then by default it will store in user/hive/warehouse.
    9) Show databases; it will list out all the databases (directory) in Hive warehouse. The database creates in a default location of hive i:e /user/hive/warehouse.
    10) Create database satya/.
    11) Use // With use command we will connect to the database.
    12) Show tables ; // this is used to show if the table is in db or not?
    13) Describe table_name ; // It will show all the contents of the table.
    14) SHOW FUNCTIONS ; // It will show all the HIVE functions and operators.
    15) Load data local inpath ‘ \var\www\dataset\directory\filename\.csv ‘ into table tbl_first ;
    16) Select * from tble_first limit 10 ; // it will show the top 10 table.
    17) Select name from tble_first order by name desc limit 20 ; // here first the column name will be sorted and then 20 name will be accessed based on the alphabetical order in desc order.
    18) Select * from tbl_first order by name desc limit 20;

    Hive Table Types :-
    ===================
    There are three types of Hive Tables :-

    1) Inner or Managed table and External Tables.
    2) Partitioned and Non-partitioned tables.
    3) Bucketed and Non-Bucketed tables.

    1) Managed or External Tables :-
    =============================

    Hive > create table test ( line string ) ; // creating the schema of the table named “test”
    Hive > load data local inpath ‘temp’ into table test ; // this command will load the data

    Note : – By Default a table is Managed or Inner table.

    2) External Tables :-
    ====================
    Hive > create external table extest ( line string ) ; // this will create a schema of the table.
    Hive > load data local inpath ‘temp’ into table extest ; // this will load the data from the local file “temp” and put it into the table “extest”.
    Hive > create database mydb ; // use to create a database “mydb”.

    Note : By-default a table is formed as “Managed”, if you drop a managed table, it will drop both the data presented in the warehouse and the schema presented in the RDBMS.
    But, if you drop external table it will drop only the schema from the MetaStore but not the data from the warehouse presented in the HDFS system.

    The main advantages of the external table, is that the data is presented in the HDFS file system. So, if the data is dropped we can reuse the data presented in the RDBMS.
    Also it is good practice to create a file as external type.

    Groupby :-
    ===========
    It is used to the separate results based on group, internally HQL will converted to the MapReduced algorithm and form groups.

    Note : On Grouping the data we apply the aggregations operations.

    Arrays : Collections of the elements of similar datatype.
    Map : Collections of the key, values pairs. // just like dictionary in Python.
    Struct : collection // Collections of elements of different data types.

    OrderBy :- Orderby is the clause we use with “SELECT” statement in Hive queries, which helps sort data.

    Aggregations Operations :-
    1) MIN
    2) MAX
    3) COUNT
    4) AVG
    5) SUM

    Normalization is used for mainly two purposes:-

    1) Eliminating redundant ( useless and repeted) data.
    2) Make sure that the data is logically stored.

    Hive also have the same Data anomalies like Insertion, deletion and updation.

    So, to overcome these anamolies we have the normalizations rule.

    1) First Normal Form.
    2) Second Normal Form
    3) Third Normal Form
    4) BCNF
    5) Fourth Normal form.

    1) First Normal form (1NF) :-
    ===========================
    For a table to be in the first normal form , it should follow the following 4 rules :-

    1) It should only have single (atomic) valued attributes/ columns.
    2) Values stored in column should be of the same domain/datatypes.
    3) All the columns in a table should have a unique names.
    4) And the order in which data is stored, does not matter.

    2) Second Normal Form (2NF):-
    ===========================

    For a table to be in the second Normal form :-

    1) It should be in the First Normal form.
    2) And it should not have any partial dependency.

    3) Third Normal Form ( 3NF) :-
    ===========================
    A table is said to be in the Third normal form when:-

    1) It is in the second normal form.
    2) And , it does not have transitive dependency.

    4) Boyce and codd Normal form :-
    ==============================
    BCNF or Boyce and codd Normal form is a higher version of the third normal form. This form deals with certain type of anomaly that is not handeled by 3NF.
    A 3NF table which does not have multiple overlapping candidates keys is said to be in BCNF , for a table to be in BCNF, following conditions must be satisfied.
    a) It must be in 3rd Normal form.
    b) And for each functional dependency ( x-> y) , X should be the super key.

    5) Fourth Normal form (4NF) :-
    =============================
    A table is said to be in the fourth Normal form when :-

    1) It is in the Boyce-codde normal fomr
    2) And, it does not have multiple – valued dependency.

    So, Normalization is to remove the duplicates and redundancy from the table.

    Please free to comment if there is any improvement.

    Thanks

Skip to toolbar