Hey everyone, this is Reshma from Edureka and in today’s tutorial we’re going to focus on Hadoop. Thank you all
the attendees for joining today’s session. I hope that you’ll all enjoy
this session but before I begin I want to make sure that you all can hear me
properly so kindly drop me a confirmation on the chat window so that
I can get started. Alright, so I’ve got a confirmation from Kanika Neha Keshav
there’s Jason Sebastian okay so we’ll start by looking at the
topics that we’ll be learning today. So we’ll start by learning the big data
growth drivers, the reasons because of which data has been converting into big
data. Then we’ll take a look at what is big data and we’ll take a look at the
solution of big data which is Hadoop. So we’ll also see the master/slave
architecture of Hadoop and the different hadoop core components, we’ll also study
how HDFS stores data into data blocks and how the read/write mechanism works in HDFS. Then we’ll understand the programming part of Hadoop which is
known as MapReduce and we’ll understand this with a MapReduce program we’ll
understand the entire MapReduce job workflow and we’ll see the Hadoop
ecosystem the different tools that the Hadoop ecosystem comprises of and
finally we’ll take in a use case where we’ll see how Hadoop has solved all the
big data problems in real life. So I hope that the agenda is clear to everyone, all
right then it seems that everyone is clear with the agenda so we’ll get
started and we’ll begin with the big data growth drivers. Now the reasons
behind the growth of big data could be numerous ever since the enhancement of
technology data has also been growing every day now if you go back in time
like in 70s or 80s not many people were using computers there were only a
fraction of people who are dealing with computers and that’s why the data fed
into computer systems was also quite less but now everyone owns the gadget
everyone has a mobile phone everyone owns a laptop and they are generating
data from there every day you can also think of Internet of Things as a factor
nowadays we are dealing with smart devices we have smart appliances that
are interconnected and they form a network of things which is nothing but internet of things so these smart appliances are
also generating data when they’re trying to communicate with each other and one
prominent factor behind the rise of big data that comes to our mind is social
media we have billions of people on social media because we human we are
social animals and we love to interact we love to share our thoughts and
feelings and social media website provides us just the platform that we
need and we have been using it extensively every day so if you look at
the stats in front of your screen so you can see that in Facebook the users
generates almost 4 million likes every 60 seconds similarly on Twitter there is
almost 300 thousand tweets every 60 seconds on reddit there is 18,000 user
cast votes on Instagram there are more than 1 million likes and on YouTube
there is almost 300 hours of new video uploaded every 60 seconds now this is
data for every 60 seconds you can imagine the kind of data that we are
dealing with every day and how much data we have accumulated throughout the years
ever since the social media website have started now that’s a lot of data and it
has been rising exponentially over years so let’s see what Cisco has to tell
about this now you all know that Cisco is one of the biggest networking
companies and they have monitored the data traffic they have been getting over
years and they have published this on their white paper which they publish
every year and we can see from here the stats that they have provided that by
2020 we’ll be dealing with 30.6 exabytes of data now one exabyte is 10 raised to
the power 18 bytes now that’s a lot of zeros that even you can think of. In 2015
if you see that we were only dealing with 3.7 exabytes and now in just five
years we’re going up to 30.6 exabytes now it can be more in the
coming years because the data has been rising exponentially and we are dealing
with a lot of data now and cisco has also mentioned the three major reasons
because of the rise of the data now the first one is adapting to smarter mobile
devices now gone are the days when we were using
phones like Nokia 1100 which is able to only call people and to receive calls
and just send a few lines of text messages nowadays everyone is using
smartphones and we are using different apps and our phones so each of the app
is generating a lot of data. The next reason that they have mentioned is
defining cell network advances now earlier we had 2G now we had come
with 3G and 4G and we’re looking forward to 5g now to time we are advancing in
the cellular network technology also and it has made us feasible for us to
communicate even faster and in a better way and that’s why since I already told
you that we love to share things it has become very easy for us to send a
message or send videos or any kind of files to our friend who is even
countries apart and it takes only a few seconds not even seconds milliseconds
for that person to receive that message and that is why we’re using it
extensively because of the ease of the use that we are getting provided and the
next reason that they have mentioned is reviewing tiered pricing now the network
companies are also providing you with a lot of data plans that your entire
family can use now we have unlimited data plans and shared plans which is
very feasible for us again and that’s why we’re using it extensively so there
are a lot of mobile users nowadays now the stats also say that we have 217 new
users every 60 seconds so you can imagine that almost out of the world
population almost everyone uses a mobile phone now well almost
so you can say that we are dealing with a lot of data and that is why the main
comes as big data so now let us see what is big data now as the name goes big and
data you already understood that it is a large cluster of data that we are
dealing with but if you ask me I see it as a problem statement that surrounds
the in capabilities of a traditional system to process it so when the
traditional systems were created we never thought that we’ll have to deal
with such amount of data and such kind of data so
they are unable to process this amount of data that is being generated and with
such a high speed and that’s why big data is a problem because a traditional
system are not able to store the big data and process it now since I told you
that big data is a problem so IBM have suggested 5vs
in order to identify a big data problem and those are in front of your screen so
the first one is “volume” so it implies that the amount of data the client is
using is so huge that it becomes increasingly difficult for the client or
customer to store the data into the traditional systems and then is the time
that we should approach for a solution and the next V we’ll talk about is
“variety” now we already know that we are dealing with huge volume of data with
exabytes of data but these are coming from a variety of sources now we’re
dealing with mp3 files we’re dealing with video files images JSON now they
are of all different kinds so the mp3 files and video files they are all
unstructured data JSON files are semi structured and there are some structured
data as well but the major problem is that most of the data almost 90% of the
data is unstructured so should we just dump all those unstructured data or
should we make use of it obviously we should make use of it because those
unstructured data that we are talking about because in facebook we mostly
share photos videos which are unstructured those are very important
data because these are used by companies to make business decisions that is
gained by insights so this data provide the companies an opportunity to profile
their customers because in facebook you go around liking different pages and
that is profiling because now the company knows that what kind of things
you like and they could approach you by advertising because in facebook you
can see that when you’re browsing onto your newsfeed on the right-hand side
there are certain apps popping up and you’ll find out that those ads are also
users specific they know what kind of things you like because you have browsed
through different pages on Facebook on Google or many other websites so that is
why these unstructured data which comprises up the 90% of data is very
very important and this is also a problem because our
traditional systems are incapable of processing this unstructured data. The
next V that comes up is “velocity” so let’s talk about the webservice to
understand this case so if you create a web service and you provide the web
service for clients to access so how many events the web service can handle
at a point of time so you can say maybe thousand or two thousand so generally
there will be almost two thousand live connections at any point of time on an
average normally there is always a restriction to the number of live
connections available at that point so you suppose that your company has a
threshold of five hundred transactions at a point of time and that is your
upper limit but today you cannot have the amount of number in the big data
world you talk about sensors you talk about the Sheen’s that is continuously
sending you information like GPS is continuously sending you the information
to somebody you’re talking about millions and billions of even strikes
per second on real time so you need some extended capabilities which withstand
that amount of velocity that data is getting dumped into your traditional
systems so if you think that the velocity can be a challenge to your
customer then you propose them again a big data solution because this is again
a big data problem. Now the next V that we’ll talk about is “value” now if your
data set cannot give you the necessary information which you can use to gain
insights and develop your business then it’s just garbage to you because it is
very important that you have the right data and you can extract the right
information from out of it now there might be unnecessary data lying around
in your data set that is unnecessary for you now you’ll also have to be able to
identify which data set will give you the value that you need in order to
develop your business so that is again a problem in order to identify the
valuable data and hence it is again a big data problem and finally we’ll talk
about “veracity” so veracity talks about the sparseness
of data so in simple words veracity says that you cannot expect the data to be
always correct or reliable in today’s world you might get few data which has
missing values you may have to work with various types of data that is incorrect
or data which may not always hold true so in other words veracity means that
you have to trust and make the system to have an understanding that the data may
not always be correct and up to the standards it is up to you as an
application developer that you have to integrate the data and flush out those
data that does not make any sense and extract only those data that makes sense
to you and use those data for making decision at the end so these are the
five Z’s that will help you to identify a big data problem whether your data is
big data or not and then you can find the approach for a solution for it
so this was an introduction to big data so now we’ll understand the problems of
big data and how you should approach for a solution for it with a story that you
can relate to so I hope that you’ll find this part very interesting. So this is a
very typical scenario so this is Bob and he has opened up a very small restaurant
in a city and he attired a waiter for taking up orders and this is the chef
who cooks all those orders and finally delivers them to the customers now what
happens here is that this is the cook and he has access to a food shelf and
this is where he gets all the ingredients from in order to cook a
particular dish now this is the traditional scenarios so he’s getting
two orders per hour and he’s able to cook two dishes per hour so it’s a happy
situation for him so he’s cooking happily the customers are getting served
because there are only two orders per hour and he has got all the time he has
got access to the food shelf also it’s a happy day similarly if we compare the
same scenario with your traditional processing system so data is also being
generated at a very steady rate and all the data that is being generated is also
structured which is very easy for our traditional system to process it so it’s
a happy day for the traditional processing system too. Now let us talk
about a different day so this is the other scenario so Bob decided to
take online orders now and now they are receiving much more orders than expected
so from two orders per hour now the orders have rised to ten orders per hour
and now he has to cook ten dishes every hour so this is quite a bad situation
for the cook because he is not capable of cooking ten dishes every hour where
beforehand he was only doing two dishes every hour so now consider the scenario
of our traditional processing system too so there are a huge number and huge
variety of data that is being generated at alarming rate
they have already seen the stats that I just showed you that in every 60 seconds
how much data is being generated so the velocity is really high and they are all
unstructured data and our traditional processing system is not capable of
doing that so it’s a bad day for our processing system too so now what should
be the solution for it so I would ask you guys so what should Bob do right now
in order to service customers without delay all right so I’m getting some
answers so Sebastian is saying that Bob should hire more cooks and exactly
Sebastian you are correct so the issue was that there were too many orders per
hour so the solution would be hire multiple cooks and that is exactly what
Bob did so he hired four more cooks and now he has five cooks and all the cooks
have access to the food shelf this is where they all get their ingredients
from so now there are multiple cooks cooking food even though there are ten
orders per hour maybe each cook is taking two orders every hour and they’re
serving people but there are issues still there now because there is only
one food shelf and there might be situations like both of the coasts maybe
let’s say these two cooks one the same ingredient at the same time and they are
fighting over it or and the other cooks have to wait until one of the cooks have
taken all the ingredients from the food shelf and by that time maybe
he has got something on the stove and it has already burned because he was
waiting for the other cook to go so that he can get his hands on the ingredient
that he wants so again it is a problem so now let us consider the same
situation with their traditional processing system so now we have got
multiple processors in order to process all the data which was being problematic
so it should solve the problem right but again there is a problem because all
this processing units are accessing data from a single point which is the data
warehouse so bringing data to processing generates a lot of network overhead
there would be a lot of input/output overhead and there would be a network
congestion because of that and sometimes there might be of situations like
processing unit is downloading data from the data warehouse and the other units
have to wait in queue in order to access that data and this will completely fail
when you want to perform near real-time processing when situations are like this
that is why this solution will fail – so then what should be the solution so can
I get a few answers okay so geisha says that it should be
distributed and parallel you are right geisha flow since the food shelf is
becoming a bottleneck for Bob so the solution was to provide a distributed
and parallel approach so we’ll see how Bob did that so as a solution what Bob
did is that he divided up an order into different tasks so now let us consider
the example of meat sauce let’s say that a customer has come into Bob’s
restaurant and he has ordered a meat sauce so what happens in Bob’s kitchen
now is that each of the chefs have got different tasks so let’s say in order to
prepare meat sauce this chef over here he only cooks meat and this chef over
here he only cooks sauce and he has also hired a head chef in order to combine
the meat and the sauce together and finally serve the customer so this cooks
cook the meat and these two cooks prepare the sauce and they are doing
this parallely at the same time and finally this head chef merges the order
and the order is completed now if you remember that the food shelf was also a
bottleneck so what Bob did in order to solve this is that he distributed the
food shells in such a way that a chef has got his access to his own shelf so
this shelf over here holds all the ingredients that this chef might need
and similarly he has got three more shelves that has got the same
ingredients now again let’s say that we have a problem that one of the cooks
falls sick so in that case we don’t have to worry much since we have got another
cook who can also cook meat so we can tackle this problem very easily and
similarly let’s say there comes another problem where a food shelf breaks down
and this cook over here has no access to ingredients so again we don’t have to
worry since there are three more shelves and at that time of disaster we have a
backup of three more shelves so he can go ahead and use ingredients from any of
the shelves over here so basically we have distributed and made parallel the
whole processor tasks into smaller tasks and now there is no problem in bulks
restaurant is able to service customers happily let me relate the situation with
her we’re and let us consider where I’ve
told you that each of the chef’s has got his own food shelf in Hadoop terms this
is known as data locality it means that data is locally available into the
processing units and this whole thing where all the different tasks of cooking
meat and sauce are happening parallely this is known as map in Hadoop terms and
when they’re finally merged together and finally we have got a meat sauce as a
dish by the head chef this is known as reduced and we’ll be learning MapReduce
in Hadoop later on this tutorial don’t get confused with the terms if I’m
speaking it right now and you’re not able to understand it you’ll be clear at
this end of this tutorial I promise you that so now he is able to handle all the
ten online orders per hour and even at times let’s say on Christmas or New Year
even if Bob is getting more customers more than ten orders per hour this
system that he has developed it is scalable he can hire more chefs more
head chef in that case in order to serve more orders per hour this is a scalable
system so you can scale up and scale down whenever he needs he can hire more
chefs he can fire more chefs whenever he needs so this is the ultimate solution
that Bob had and this is very effective indeed but now ball has solved all these
problems but have we solved all the problems do we have a framework like
that who could solve all the big data problems of storing it and processing it
well the answer is yes we have something called Apache Hadoop and this is the
framework to process big data so let us go ahead and see Apache Hadoop in detail
so Hadoop is a framework that allows us to store and process large data sets in
parallel and distributed fashion now you know that there are two major problems
in dealing with big data the first one was storage so in order to solve the
storage problem of big data we have got HDFS because like how Bob’s solve the
food shelves problem by distributing it among the chefs
similarly Hadoop solves the storing of big data with HDFS which stands for
Hadoop distributed file system so now all the big amount of data that we are
dumping is it’s distributed over different machines
and these machines are interconnected on which our data is getting distributed
and in Hadoop terms it is called a Hadoop cluster and again like how Bob
has managed to divided the tasks among his chefs and need the serving process
quite quicker similarly in order to process Big Data we have something
called MapReduce and this is the programming unit of Hadoop so this
allows a parallel and distributed processing of data that is lying across
our Hadoop cluster so every machine in the Hadoop cluster it processes the data
that it’s got and this is known as map finally when the intermediary outputs
are combined in order to provide the final output this is called reduce and
hence MapReduce so now let us understand the Hadoop architecture which is a
master/slave architecture and we’ll understand it by taking a very simple
scenario which I’m very sure that you’ll all relate to very closely so this is a
scenario which is usually found in every other company so we have a project
manager here and this project manager handles a team of four people so the
four people here in our example are John James Bob and Alice
so whatever project he gets from a client he distributed it crossed his
team members and tracks a report of how the work is going on from time to time
so now let us consider that the project manager over here he has received four
projects from a client so let’s say the projects are a B C and D and he has
assigned all these projects across the team so John is called project a James
has got project B bob has got see Alice’s got D so everyone is handling
and working on a different project and the work is going on fine so he’s quite
sure that he’ll be able to meet the deadlines and deliver the project in
time but there is a problem Bob applied for a leave and he tells the project
manager that I’m going on leave for a week or two and I won’t be coming an
office and I can’t do the work and now it is a problem for the project manager
because at the end he is liable for the work that has not been completed to the
client so this person has to make sure that all the projects are delivered
at time so it thinks off plan because he’s a very clever fellow so in order to
tackle this problem what the project manager does he goes to John and he
tells him hey John how are you doing and John says yeah I’m doing great
yeah I heard that you’re doing really great and you’re doing excellent in your
project the John said things that something’s fishy why is appreciating me
so much today then the project manager goes ahead and
tells him so John since you’re doing so well why did you take up the project see
as well and the John things okay that’s it and then it replies back to the
manager that no I’m fine with my project that I’ve got I have a lot of work to do
already I don’t think I can take project C then the project manager says no no no
you’ve got me wrong you don’t have to work on projects see you know that bob
is already working on project C well you can keep it as your backup project and
you might never know that you might not have to even work at the end at project
C but you’ll get the credit for both the projects at the end and I could refer
you for a substantial hike and then John thinks it’s a quite good deal he might
not even have to work on it and he’ll get a hike for that so that’s why Greece
and it takes up project C so now the project manager has done its job he
doesn’t have to worry about completing projects see even if Bob is going out of
town and this is a very very clever fellow in order to tackle even future
problems what he does he goes to each of the members and tells them the same
thing and hence now he has got a backup for all the different projects so if any
of the members ever even opted out of the team he has got a backup and this is
how a project manager completes all his tasks at the given time and the client
is satisfied and then he also makes sure that he has updated his list as well in
order to know who is carrying the back of projects as well and this is exactly
what happens in Hadoop so we have got a master node that supervises the
different slave nodes the master node keeps a track record of all the
processing that is going on in slave nodes and in case of disaster if any of
those goes down the master know has always got a backup now as we
compare this whole office situation to our Hadoop cluster this is what it looks
like so this is the master node this is the project manager in case of our
office and these are the processing units where the work is getting carried
out so this is exactly how Hadoop processes and Hadoop manages Big Data
using the master slave architecture so understand more about the master node
and the slave nodes and detail later on in this tutorial so any doubts till now right so now we’ll move ahead and we’ll
take a look at the Hadoop core components and we’re going to take a
look at HDFS first which is the distributed file system in Hadoop so at
first let’s take a look at the two components of HDFS since we’re already
talking about master and slave nodes so let us take a look at what is name node
and data node so these are the components you’ll find in HDFS so since
we’re already talking about a master/slave architecture so the master
node is known as the name node and slave nodes are known as data node so the name
node over here this maintains and manages all the different data nodes
which are slave nodes just like our project manager manages a team and like
how you guys report to your manager about your work progress and everything
the data nodes also do the same thing by sending signals which are known as
heartbeats now this is just a signal to tell the name node that the data node is
alive and working fine now coming to the data node so this is where your actual
data is stored so remember when we talked about storing data in a
distributed fashion across different machines so this is exactly where your
data is distributed across and it is stored in data blocks so the data node
over here is responsible for managing all your data across data blocks and
these are nothing but these are slave daemons and the master daemon is the
name node but here you can see another component over here which is the
secondary name node and by the name you might be guessing that this is just a
backup for the name node like when the name node might crash so this will take
over but actually this is not the purpose of secondary name though the
purpose is entirely different and I’ll tell you what is that you just have to
keep patience for a while and I’m very sure that you’ll be intrigued to know
about how important the secondary name node is so now let me tell you about the
secondary name node well since we’re talking about metadata which is nothing
but information about our data it contains all the modifications that had
took place across the Hadoop cluster or our HDFS namespace and this metadata is
maintained by HDFS using two files and those two files RSS image and edit log
now let me tell you what are those so f is image this file over here this
contains all the modifications that have been made across your Hadoop cluster
ever since the name node was started so let’s say the name node was started 20
days back so my FS image will contain all the details of all the changes that
happen in ask 20 days so obviously you can imagine
that there will be a lot of data contained in this find over here and
that is why we store the essence image on our disk so you’ll find this SS image
file in the local disk of your name node machine now coming to edit log so this
file also contains metadata that is the data about your modifications but it
only contains the most recent changes let’s say whatever modifications that
took place in the past one are and this file is small and this file resides in
the RAM of your name load machine so we have the secondary name node here which
performs a task known as checkpointing now what is checkpointing it is the
process of combining the edit log with the FS image and how is it done so the
secondary name node over here has got a copy of the edit log and the SS image
from the name node and then it adds them up in order to get a new FS image so why
do we need a new FS image we need an updated file of the FS image in order to
incorporate all the recent changes also into our SS image file and why do we
need to incorporate it regularly let’s say that if you’re maintaining all the
modifications in your edit lock you know that your edit log resides in your lab
so you cannot let your edit log file to grow bigger because as time passes
you’ll be making more modifications and more changes and this will get stored in
the edit log only first so that’s why the file gets bigger it might end up
taking a lot of space in your RAM and we’ll make the processing power of the
name node quite slower and also during the time of failure let’s say that your
name node has failed and you want to set up a new name node you’ve got all the
files that is needed in order to set up a new name though you’ve got the most
updated recent copy of the SS image all the metadata that you need about your
data nodes that your name node is managing so that will be found in your
secondary name node and that’s why your failure recovery time will grow much
lesser and then you’ll not lose much data or much time in order to set up a
new name node and my default the checkpoint thing happens every hour and
by the time when the checkpoint is happening you might be making some more
changes also so those changes are stored in a new edit
and until the next checkpoint happens we’ll be maintaining a new edit log file
that will again contain all the recent changes since the last checkpoint so
this will be ready log in again when we are performing checkpoint again so we’ll
take in all the modifications all the data in this edit log and then combine
it with the last SS image that we had so this checkpoint thing keeps on going
on and by default it takes place every one R if you want the checkpoint thing
to happen in minimum intervals you can also do that if you wanted after a long
time you can also configure it so we have studied about the HDFS components
we have taken a look at what is name node and how does it manage all the data
nodes we have also seen the functions of secondary name nodes now so now let us
see how all this data is actually stored in all the data nodes so HDFS is a block
structured file system and each file is divided into a block of particular size
and by default that size is 128 mV so let us understand how HDFS stores
files and data blocks with an example so suppose a client wants to store a file
which is of 380 MB and he wants to store it in a Hadoop distributed file system
so now what H DSS will do is that it will divide up the file into three
blocks because 380 MB divided by 128 MB which is the default size of each data
block is approximately three so here the first block will occupy 128 MB the
second block will also occupy 128 MB and the third block will be of the remaining
size of the file which is 124 MB so after my file has been divided into data
blocks this data blocks will be distributed across all the data nodes
that is present in mojado clustered here you can see that the first part of my
file which is 128 MB is indeed a node 1 the next data block is in my data node 2
and my final data block is an data node 3 and if you notice the size of all the
blocks are same except for the last one so this is a 124 MB data block and this
helps Hadoop to save the HDFS space as the final block is using only that much
of space that is needed to store the last part of
so therefore we have saved 4mb from being wasted in this scenario now it may
seem very little to you that we have only said four MB so what’s the big deal
but imagine if you are working with tens of thousands of such files how much data
you can save here so this was all about data blocks and how HDFS stores data
blocks across different data nodes and I suppose that by now you have understood
that why do we need a distributed file system so let me tell you that we have
got three advantages when we are using a distributed file system so let me
explain this to you with an example so now I imagine that I have got a Hadoop
cluster with four machines so one of them is the name node and the other
three are data nodes so where the capacity of each of the data node is one
terabyte so now let’s suppose that I have to store a file of three terabytes
so since all my data nodes have a capacity of one terabyte this will be
distributed the file of three terabyte will be distributed across my three data
nodes and one terabyte will be occupied in each data node so now I don’t have to
worry about how it is getting stored so HDFS will manage that and if you see
that this provides me with an abstraction of a single computer that is
having a capacity of three terabytes so that’s the power of HDFS and let me
explain you the second benefit of using a distributed file system so now
consider that instead of three terabytes I have to store a file of four terabytes
and my cluster capacity is only of three terabytes so I’ll add one more data node
in my cluster in order to fit my requirements and maybe later on when you
need to store a file of huge size you can go ahead and add as many machines in
your cluster in order to fit all your requirements to store the file so you
can see that this kind of file system which is distributed is highly scalable
now let me tell you the third benefit of using a distributed file system now
let’s consider that you have a single high hand computer which has the
processing power of processing a one terabyte data in four seconds now when
you’re distributing your file across the same single computer with the same
capacity are the same processing power you are
reading that file parallely so instead of one if you have got four data nodes
in your cluster so it will take one bite force of your actual time which we are
doing with a single computer so it will take you only one second so basically
with the help of distributed file system we are able to distribute our large file
across different machines and we’re also reducing the processing time by
processing it parallely and because of this were able to save huge amount of
time in processing the data so this are the benefits of using HDFS and now let
us see that how Hadoop can cope up with the data node failure now you know that
we are storing our data in data node but what if a data node fails so let us
consider the same example over here you know that I have got to find a 380 MB
and I have got three data blocks which are distributed across three data nodes
over here in my Hadoop cluster so let’s say the data node which contains the
last part of the file it crashes what to do that now you have lost a part of your
file highly process that file right now because you don’t have a part of it so
what do you think could be a solution for that so I’m getting an answer so
casian says that we should have backup yes exactly so the logical approach to
solve this problem would be that we should have multiple copies of the data
right and that is how Hadoop was assaulted by introducing something which
is known as the replication factor you all know what a replica is replica is
nothing but a copy and similarly all our data blocks will also have different
copies and in HDFS each of the data block has got three copies across the
cluster so you can see that this part of the file which is 124 MB this data block
is present in data node two data node three and data node 4 and similarly this
is common to the other data blocks as well so every data block will be there
in my Hadoop cluster three times even if one of my data node gets crashed and I
lose all of the data blocks that was inside the data node I don’t have to
worry because there are two more copies present in the other data nodes and we
have to that because since in Hadoop we are
dealing with commodity hardware’s and it is very likely that our commodity
hardware will crash at some point of time so that’s why we maintain three
copies so even if two of them go out we still have got one more so this is how
HDFS performs fault tolerance and I have got a question from Neha
so she is asking that do we have to go ahead and make replicas of our data
blocks well-known Li how you don’t have to do that whenever you put any kind of
file when it will copy any file in your Hadoop cluster your files will get
replicated by default and by default it will have a replication factor of three
it means that every data block will be present automatically three times across
your Hadoop cluster so I hope that may have you’ve got your answer okay so she
is saying yes Thank You Niihau for the question that was a very good question
indeed so we don’t have to worry now if a data node gets crashed we have got
multiple copies and since you know the proverb that never put all your eggs in
the same basket this is very very true in case of this scenario that we are
dealing with right now so we are not putting all our eggs in the same basket
we’re putting our eggs in three different baskets right now so even if
one basket also and all the eggs crack open we don’t have to worry we have
enough eggs for our omelet if I hope that you all have understood how HDFS
provides fault tolerance if you have any questions you can go ahead and ask me or
whenever you get questions you can ask me to this end of this session so now
let us understand what happens behind the scene when you are writing a file
into the HDFS so when you want to write a file across your Hadoop cluster you
have to go through three steps and the first step that you should go through is
the pipeline setup so let us understand how to set up a pipeline with an example
so let’s say that I have got a text file maybe this is called example dot txt and
I have divided into two data blocks which is block a and block B so let us
talk in terms of block a first let us see how to write block a across my data
nodes in my HDFS so here is the client so the client
at first request the name note telling that I have got a blog that I need to
copy so the name note says okay so I’ll give you the IP address of three data
nodes you can copy your file in this three data nodes and you know that you
have to copy your block three times because apparently the replication
factor is three the name node here gives the IP address of three data nodes data
node one four and six to the kingdom so now the client node has caught the IP
addresses of three data nodes where block a will be copied so at first what
he does he goes and checks two data node 1 and as they don’t want that hey I want
to copy a block on your data node so are you ready and can you just go and ask
data node 4 and 6 if they’re ready and the data says yeah I’m ready so I’ll
just go ahead and ask for m6 so now data node 1 goes to data node 4
and as hey so the kind is asking for you to copy block sorry ready then 4 says
yeah I’m ready and then once it’s okay just go ahead and have 6 if he is ready
also so for us 6 and then 6 is also ready and this is how the whole pipeline
is set up that first block a will be copied to data node 1 then data node 4
and then they denote 6 let’s say that in situations there are no data nodes
available that whatever IP address the name node gave maybe those are not
functioning or those data nodes are not working so in that case when the client
node doesn’t receive any confirmation he goes back to the main no that says hey
whatever IP addresses that you’ve given those data nodes are not working so
would you go and give me another one and then the name node goes on and checks
that which are the data nodes are available during that time and give IP
address to the client node again so now your pipeline is ready so your pipeline
is set up now so first it will be copied onto data node 1 then they tunnel 4 and
then data node 6 so this is your pipeline so now comes the second step
where the actual writing takes place so now since all the data nodes are ready
to copy the block so now the client will contact data node 1 first and data node
1 will first copy block a so now the client will give the responsibility to
data node 1 in order to copy the rest of the block
and data note 4 and data node 6 so now the data node 1 will contact data node 4
and tell that copy the block a onto yourself and ask data node 6 to do the
same so data node 4 will then copy block 8
and then pass the message on to data node 6 and similarly data node 6 will
also copy the block so now you have got three copies of the block just as we
require so this is how the writing takes place after that the next step is a
series of acknowledgment so now we have a pipeline and we have written our
blocks onto the data nodes that we wanted to so now the acknowledgement
will take place in the reverse order as the writing so at first data node 6 will
give an acknowledgement to data node 4 that I have copied block a successfully
than they did before we received that acknowledgement and pass it to data node
1 that I have copied block a onto myself and so has data node 6
so all this acknowledgment will be passed to data node 1 and the data node
1 will give an acknowledgment finally to the client node that all the three
blocks have been copied successfully and after that the Clinard will send a
message to the name node that the writing has been successful that all the
blocks have been copied to the data nodes 1 4 & 6 so the name node will
receive that message and update its metadata where all the blocks are copied
in which data node so this is how the right mechanism takes place first a
pipeline setup then the actual writing and then you get an acknowledgment so
now we just talked only about a single block as I told you that my file my
example dot txt file was divided into two blocks block a and block be the
right mechanism for Block B will be similar only when the client node will
request a copy block B he might get the IP addresses of different data nodes for
example the Block B is copied to 3 7 and 9 and block a was copied on 1 4 & 6 now
let me tell you that the writing process of block ay and block B will happen at
the same time now obviously I told you that the writing mechanism it takes
place in three steps so the actual writing will happen sequentially that
means it will first get copied to the first data node then the second
and then the third but the blocks will be copied at the same time so this is
how the writing mechanism takes place so the writing of the block and block B are
taking place at the same time so 1a and 1b step is taking place at the same time
2a and 2b step are taking place at the same time so when the client is copying
the different blocks on two different data nodes 1a and 1b this is taking
place at the same time and then 2a and 2b are also taking place at the same
time when the first block that block is getting copied on to data node 1 and
when Block B is getting copied at the data node 7 similarly the other steps
are also taking place at the same time as many as block your file contains all
the blocks will be copied at the same time in sequential steps onto your data
node so this is how the writing mechanism takes place so now let us see
what is a story behind reading a file from your different data nodes in your
HDFS so let me tell you that reading is fairly much simpler than writing a block
onto your HDFS so let’s say now my client wants to read that same file that
has been copied across different data nodes in my HDFS so you know that my
block a was copied on to data node 1 4 and 6 and block B was copied on to data
node 3 7 and 9 so now my client will again request the name node that I want
to read this particular file and my name wood will give the IP addresses where
all my data blocks of that particular file are located so the Kline node will
receive that IP address and contact the data nodes and then all the data blocks
will be fetched my data block a and my data Block B will be fed simultaneously
and then it will be read by the client so this is how the entire read mechanism
takes place so guys this is all about HDFS we have seen how a file is copied
in your HDFS how a file is copied across a Hadoop cluster in a distributed
fashion then we have seen the advantages of using a distributed file system we
have also understood what is the name node and what are data nodes and how our
data is stored and how is your files stored and divide it up into data blocks
and spread across your Hadoop cluster we have all seen that how Hadoop deals
when our data node feels and they introduced a replication factor as a
backup for your file and then we have also seen how the read and write
mechanism takes place so I hope that you have all understood what is Hadoop
distributed file system if you have any questions you can ask me and now let us
go and move on and let us check what is MapReduce now you already remember the
example that we have given at the start of her session cook example how
different chefs cook different dishes at the same time and finally a head chef
assembles the dish all together and finally gives the desired output so this
is what we’ll be learning now we’ll be learning with more relevant examples so
that you can understand MapReduce better so let us understand MapReduce with
another story which we’ll find amusing again I’m very sure about that so let us
consider a situation where we have a professor and there are four students in
the class so they are reading a Julius Caesar book so now the professor wants
to know how many times the word Julius occurs in the book so for that he asked
his students that go ahead read the entire book and tell me how many times
the word Julius is there on the book so all of the students have got a copy of
the book and they start counting the word Julius so it took them four hours
to do so so the first student answered that I’ve
got 45 times the second one answers 46 maybe made a calculation mistake or
maybe is correct we don’t know that because we don’t have the book Bank and
the third student also replies 45 and the fourth also replies 45 and then the
professor decides that okay three people can’t be wrong so I have to go with the
majority and majority is usually correct please go through the answer that the
word Julius was appeared 45 times in the entire book and it took a time of four
hours so then the professor thought that it’s taking a lot of time so this time
what the professor did he applied a different method so let us assume that
the book has got four chapters so he distributed each chapter to each of the
students he asked the student one that you go to
chapter one and tell me how many times Julius occurs in Chapter one and
similarly he gave the same task and assigned chapter 2 to the second student
chapter three to the third and chapter four to the fourth so now since they are
only assigned with one chapter instead of the entire book they’re able to count
the Julius word or finish up an entire chapter in just one R and they’re doing
it at the same time so at the same time chapter 1 has been counted chapter 2 has
been counted chapter 3 has been counted in chapter 4 has also been counted and
everyone gave the respective answer so this student went up to the professor
and said that I found the word Julius 12 times in chapter 1 and the second
student said I found it 14 times in chapter 2 chapter 3 he says that I found
it eight times and chapter four he says that I found it 11 times so the
professor received all the different answers from all the four students and
finally he adds them up in order to get the answer of 45 and let’s assume that
it took him two minutes to add them up now these are very small numbers so he
might not make two minutes but we are just assuming it so instead of four
hours now we are able to find out the correct answer in just one or two
minutes so this is a very effective solution so the part where each of the
students were distributed and each of them were working on a part of the book
this part is known as map and finally when the professor is summing up all the
numbers together this part is known as reduce and this entirely is map produced
in the concepts of Hadoop so all the processing of a single file is divided
into parts and they are getting processed simultaneously and finally the
reducer adds all the intermediate results and gives you the final output
and this is a very effective solution because all the tasks are happening
parallely and in a lesser time so I hope that you have understood with this
example you have understood the essence of MapReduce with this example so now
let us go ahead and understand MapReduce in detail so MapReduce is the
programming unit and hadoo so this is a framework
the advantage of using a distributed framework in order to process large data
sets so the MapReduce consists of two distinct tasks so the first one is known
as map and the second task is known as reduce and as the named MapReduce
suggests the reducer phase takes place after the mapper phase has been
completed because the releaser needs intermediate results that is produced by
map in order to combine it and finally give you the final output so the first
is the map job where a block of data is read and process to produce key value
pairs as an intermediate output and then the output of a mapper or a map job
which are nothing but key value pairs is input into the reducer and then the
reducer receives the key value pair from multiple map jobs and then it aggregates
all the intermediate results and finally gives you the final output in the form
of key value pairs so this is how MapReduce takes place we’ll be
understanding this in detail now so I hope that you have all understood this
so let us move on right now and understood MapReduce with an example
which is a word count program so let us say that we have got a paragraph we have
got this much text DRB a river car car river deer car beer and we want to find
out that how many times each word appear in this particular sentence or in this
particular paragraph so this is how MapReduce works so now we have divided
and since you know that we divide up the entire task into different parts here
we’ll divide up each of the sentence into 3 because there are three sentences
so this is the first sentence deer beer river the second one is car car river
and the third is to your car beer so now the mapping will take place on each of
the sentences over here and since I already told you that a map job is
something where a data is read and then a key value pair is formed so we have
got the key which is each of this word and then a value is assigned here which
is nothing but 1 so here the mapping takes place so each of them is converted
into a key value pair with the word and a number one so it happens similarly in
the other two sentences as well so first we divide up the input in three splits
as you can see in the figure over here that we are divided into three parts and
the three sentences that we have in our back rub so the first one is deer beer
River the second one is car car river and then deer car beer and then we’ll
distribute this work among all the map nodes so after that and mapping what
happens we tokenize the words in each of the mapper and give a hard-coded value 1
so the reason behind giving the hard-coded value 1 is that every word in
itself will occur once so now a list of key value pair will be created where the
key is nothing but the individual word and the value is 1 so after the mapper
sorting and shuffling happens so that all the keys are sent to the
corresponding reducer so after the sorting and shuffling each of the
reducer will have a unique key and a list of values corresponding to that
very key so we have got beer two times so we have got the key beer and its
value 2 times 1 and what so now the reducer what it will do is that if you
count the values which are present in that Lane
stuff values so here one and one is two and the car was found three times three
one value so car will be three similarly D r2 and river two and finally we’ll get
all the output together in key value pairs so the reducer has combined all
the different intermediate results all together here and we have got another
key value pair which gives you the final output where we can see that beer was
found in our input two times the car was three times deer two times
and river two times so this is how MapReduce occurs in Hadoop so I hope
that you have understood this word count program so we’ll also go ahead and run
this program so I’ll tell you the major parts of MapReduce program so first you
have to write the mapper code that is how the mapping will happen how all the
distributed tasks will carry out at the same time and how they will produce key
value pairs and that comes in with user code it means that how all the
intermediate reserves the key value pairs that we have got from each of the
mapping functions and how will we merge them and then finally there is the
driver code so here you specify all the job configurations like what is the job
name the input output path and etc so this are the three parts of running
MapReduce in Hadoop so now let’s talk about the mapper code so basically this
is a Java program so for those of you who know Java and have been working on
Java this is a very simple program for you all but say let me go through and
explain the logic of this entire program so this is our mapper code and we have a
class here called map which extends to the class mapper and we have mentioned
the data types of our input/output key value pair with respect to mapper now
let me tell you that the mapper accepts input as the key value pair and gives
output also in a key value pair form so since we have this as an input which is
nothing but paragraphs and we have not specified any particular key or value to
it so the mapper here itself specifies the key as the byte offset type and the
value here would be each sentence or each tuple from the entire paragraph
that we are inputting into so the datatype of each of the key which
is nothing but the byte offset type will be wrong writable since it’s just a
number and how it takes the byte offset type let me tell you if you see the
input over here which is just a double if you see that in this sentence we have
got three words with four four and five character each and two blank spaces and
since they are all of character types and each character occupies eight bytes
of memory so if you add them up together you get 121 and this is the next byte
offset for the next couple so this is the data type of our byte offset type
which is wrong writable and the input type would be each of the sentence which
is nothing but text and if you remember the mapper produces an output again as a
key value pair so which will have nothing but each token which are also
nothing but each of the unique words in our particular tuple which is nothing
but text and then with a tokenized value a hard-coded value like we have done in
a previous example like we have assigned a hard-coded value 1 to each of the
token which is nothing but an integer so this the data type of our mapper value
output would be intractable so for this method we have got our key as divided
offset and the value as our tuples so we have got three tuples there and this
will be performed on each of the tuples in our input so the map method here
takes the key value and context as arguments so we have the byte offset as
our key and we have the tupple as our value and the context will allow us to
write our map output so what we are doing here is that we are storing each
of the tupple in a variable called line and then we’re tokenizing it means we
are just breaking up our each tuple into tokens which are nothing but each
individual words present in that tab and then we are assigning a hard-coded value
1 so each token will be our map output key along with the hard-coded value 1
and we have provided one as a hardcore value just because each of the word will
be at least occurring once in that particular tuple so the output keep
their values that will have will have something like each of the token
and then with a hard-coded value what if you remember the example which we just
learned a while ago so the output for the first couple in our example would be
D r1 b r1 and river one so this is the entire map record so now let us take a
look at the reducer good and even here we have got a class called reduce which
extends the class reducer and you remember that the reduce takes place
only after the shuffling and sorting so here the input will be nothing but the
output of our shuffling and sorting and output of shuffling and sorting with
something like this which will have a word along with its frequency or how
many times it has occurred after the mapping is done so this will be our
input and if you see the first key and the key here is nothing but a text and
the value here is nothing but an added which is of the data type int writable
and finally it produces an output with the word and how many times it has
occurred which is again nothing but a word and a number which is of the data
type text and ncredible something like this which you can see over here so what
we are doing is that so we have got a method called reduce so here we have got
the input key which is nothing but a text and the input value as an added
something like this so now since it is an array we’ll just run a loop and we’ll
sum up the number of ones for each of the token so here for bear we have got 2
1 so we’ll just sum up these two ones and finally get the result so the output
key will be text that is a particular word or a unique word and the value
would be the sum of all the ones that was associated in that particular array
so here we have got 1 plus 1 as 2 so the final output would be bear 2 and
similarly for card at the input wisc are 1 1 1 so we are getting card 3 so this
is the whole video circuit remember that I’ve told you that there
was one more section of code in the entire MapReduce code and third part is
the driver code so this code over here this will contain all the configuration
details of our MapReduce job so for example it will contain the name of my
job the data type of my input output of the mapper and reducer so you can see
that my job name is my word count program and here I haven’t mentioned the
name of my class then the mapper class which is known as map the reducer class
which is reduced and the output key class is txt so we can also set the
output value of our class and since in this example we are dealing with the
frequency of your words which are nothing but numbers so we have mentioned
indictable so again if you want to set input format class which is nothing but
this is just to specify how a mapper will process a particular input size
that is what will be the unit of work for each map and in our case the whole
input text that we had with this process line by line so we can specify that as
well similarly we can also specify how the output format class how the output
will be written on to our file which is also line by line and we can also go
ahead and set the input path we can mention the directory from which it will
fetch our input file and we can also go ahead and mention the output path or the
directory where my file or my output will be written on – so this is what
exactly a driver code contains this is nothing but just the configuration
details of your entire MapReduce code so I hope that you have all understood this
program so we’ll just go ahead and execute it so this is my VM where I have
set up my HT SS so let’s go ahead and execute the MapReduce program
practically so let me open my IDE first and for my ID I’m using eclipse so this
is my Java program that I just showed you so here is my mapper code then here
is my reducer code and this is my driver code that I just now explained it to you
in detail I don’t do that the starting point is
the main method and this is where my driver code resides so I told you
earlier that the starting point is the main method and this is where my java
code resides and here you can see that we have assigned a zeroth argument for
the input path and the first argument for the output path so my class name
here is work count this is the package where my class resides in that is in dot
ed Eureka dot Map Reduce and I was important the required Hadoop jars that
is required for this program so these are the jars and I’ve also exported this
whole program along with all Hadoop dependencies as a word count jar so this
is the jar file which you can see over here so this is it so let’s go ahead and
run this so for that I’ll just open up my terminal so now let’s go ahead and
create a directory in order to store my input and output so first I’ll create
one directory and inside that I’ll create two more directories for input
and output so for that you have to use the command Hadoop FS then – mkdir which
is for make directory and let me call the directory as workout
and now let us go ahead and create some subdirectories for input and output so
we’ll go ahead and I’ll just add input over here
and similarly let us go ahead and create the output directory as well
so I have created my directories now what I have to do I have to pull the
data set or the file that we’re dealing with into our input directory so that
Hadoop can fetch it from there and run the code so let me show you where my
file is so this is here in the home directory so this is the file so there’s
the same file that we have learned in the example which is dear river card so
this is a simple paragraph over here and we’re going to perform the word count
program on this text file over here which is known as test dot txt so and
clear the screen so we’re done with making our directories now our next step
is to put this text file or move this text file into our HDFS directory so for
that we’ll use the command Hadoop FS – put and the name of our file which is
test dot exe and our HDFS directory which is known as word count
and we wanted in our input directories so this will move it and now what we
have to do we have to run the jar file now in order to perform MapReduce on the
test dot txt file and for that we’ll use this command Hadoop jar and the name of
my jar is word count and we also have to mention the name of
the package so you remember the name of the package that I showed you in the
code which is n dot ed Eureka dot Map Reduce and also mention my class name
were my main method is so that the execution of this MapReduce program can
get started from there so the name of my class is word count and press Enter
so destroying this exception because if you remember in our driver code that we
have mentioned that our input directory is of the zeroth argument and the output
directory is of one as arguments but we haven’t mentioned it anywhere so we have
to go ahead and mention it so that Hadoop can fetch the file from our input
directory and finally store the output in mouths foot directory so now we’ll go
ahead and we’ll just mention the input and output directories so my input is in
word count flash input and my output was in word count slash output and now let
us run it so now I can see that the map produced execution is going on so you
can see that it has read some bytes and written some bytes so let us go ahead
and see the output so let me show you my output file so for
that I’d used to command Hadoop FS – LS then this is my directory see you see over here that this is my
output file so let us go ahead and check that what Hadoop has written onto this
output file or let us see the MapReduce result so for that I’ll just use the cat
command so this is the command Hadoop SS – cat
my directory and slash Astrix zero so there it is
so beer it counted all the words and it has given you the final results of the
year four times car three deer two and reverse three so this is how Hadoop
executes MapReduce and this is how you can run different MapReduce programs in
your system so this is just one simple example you can go ahead and run
different programs as well so I hope that you all have understood this so
we’ll go ahead and move on to the next topic so now let us go ahead and take a
look at the yarn components and yarn stands for yet another resource
negotiator which is nothing but MapReduce version two so let us take a
look at the components so we have got the resource manager a node manager app
master and container so the resource manager here again is the main node in
the processing department so the resource manager receives processing
requests like MapReduce jobs and then it passes on the request to the node
manager and it monitors if the MapReduce job is taking place correctly or not so
the node manager over here this is installed on every data node so
basically you can think that a node manager and the data node lies in a
single machine so this is responsible for app master and container now coming
to containers so the containers are nothing but this is a combination of CPU
and RAM so this is where the entire processing or the MapReduce task takes
place and there we have got an app master so the app master is assigned
whenever the resource manager receives a request for a MapReduce job so then only
app master is launched which monitors if the MapReduce job is going on fine and
reports and negotiates with the resource manager to ask for resources which might
be needed in order to perform that particular MapReduce job so this is
again a master slave architecture with the resource manager is the master and
the node manager is asleep which is responsible for looking after the app
master and the container so this is yard now let us go ahead and take a look at
the entire MapReduce job workflow so what happens the client node submits a
MapReduce job the resource manager and if you know the
resource manager is the masternode so this is where a job is submitted and
then the resource manager replies to client node with an application ID and
then the resource manager contacts the node manager and asked them to start the
container and then the node manager is responsible for launching an app master
for each of the application so the app master will negotiate for containers
that is the data node environment where the process executes and then it will
execute the specific application and monitor the progress so the application
master are nothing but demons which reside on data node and communicates to
containers for execution of tasks on each data node so then it will receive
all the resources that is needed so that the app master will receive all the
resources from the resource manager in order to complete that job and will
start a container so the app master will launch a container and when the
container is launched we’ll have a yard child which will perform the actual
MapReduce stuff and finally we will get the output so this is how the entire
MapReduce job workflow takes place and now let us understand what happens
behind the scene when a MapReduce job is taking place so this is our input block
and the details in the input block will be read by the map tasks and each map
has a circular memory buffer that it writes the output to and the buffer is
100 MB by default but the size of this offer can be tuned or changed by
changing the MapReduce dot IO dot sort mb property so when the contents of the
buffer reach at a certain threshold size and by default when it fills up to 0.80
or let’s say 80% so a background thread will start to spill the contents to the
disk so the map outputs will be continued to be written to the buffer
while the spill takes place but if the buffer fills up during this time the map
will block until the spill is complete so before spilling the content into disk
the thread will first divide the data into partitions corresponding to the
reducers that they will ultimately be sent to so with each partition the
background thread performs a in-memory sort by T
so each time the memory buffer reaches the spin threshold a new spin file is
created so after the map task has written its last output record there
could be several spin sites so before the task is finished the Spill files are
merged into a single partition and sorted output file and this will be done
by different mapping functions so the configuration property map produced a co
dot sort factor controls the maximum number of streams or spill files to
merge at once and the default is step so now we’ll have outputs from different
other mapping functions and finally all this outputs from different maps are
fetched and it is sent to the reducer for aggregation so you can see in this
image over here that I have received different intermediate results from
different Maps and finally they are merged together and they are sent to the
reducer in order to provide the final result so this is how MapReduce works so
I hope that if all understood this for any questions all right
so we’ll move on and we’ll take a look at the yarn architecture so we have
already gone through the components in yarn so we already know that there is a
resource manager which is the master and then we have got slave nodes again where
a node manager is present in every of the slave node and the node manager is
responsible for app master and container so we’ve got different node managers
here so when the node manager does is that it sends the node status or how
each of the node is performing a single MapReduce job and it sends a report to
the resource manager and when a resource manager receives a job request or a
MapReduce job request from a client what it does it asks a node manager to launch
app master now there is only one single app master for each application so it is
only launched when it gets a request or a MapReduce job from the client and it
is terminated as soon as the MapReduce job is completed so the app master is
responsible for collecting all the resources that is needed in order to
perform that MapReduce job from the resource manager so the app master asks
for all the resources that is needed and the resource
provides it through that app master and finally the app master is also
responsible for launching a container this is where the actual MapReduce job
or the MapReduce processing will take place so this is the entire yarn
architecture is fairly simple so I hope that you’ve understood this and now let
us take a look at the Hadoop architecture by combining both of these
two concepts together the Hadoop distributed file system and yarn so if
you see HDFS and yarn together so we have got two master nodes here so the
master node in case of HDFS is named node and in the ayat it is resource
manager so HDFS is only responsible for storing our big data so we have also the
secondary name node here which is responsible for check pointing and you
already know a checkpoint against it is the process of combining the FS image
with the edit log and first actually storing the data we have got data node
which are the worker nodes and in case of yard we our worker nodes our node
manager which is responsible for processing the data which is nothing but
a MapReduce job so you can also see that a data node and a node manager they
basically reside on a single machine so this is HDFS and yarn all together so I
hope that you have all understood HDFS and yarn so you all now know how data is
stored in Hadoop and how it is processed it hadoo so now let us take a look at
how a Hadoop cluster actually looks like so this is how a Hadoop cluster looks
like so we have got different racks together that contains different nodes
master and slave nodes all together so these are nothing but different clusters
so all these machines are interconnected and they are connected with a switch in
this particular rack we have got the master node the name node the secondary
name node and different slave nodes we can also combine small clusters together
in order to obtain a big Hadoop cluster all together so this is a very simple
diagram that shows you what a Hadoop cluster looks like so now let us see how
you can launch different Hadoop cluster or the different modes of Hadoop cluster
okay we’ll start from the bottom so we’ll start with multi node cluster mode
so the previous image that I’ve just shown you is a multi node cluster mode
so let me just go back and show it to you again so this is a Hadoop multi node
cluster so we have got multiple nodes over here we’ve got name nodes which are
master nodes and worker nodes on different machines so this is a multi
node cluster and then we have got pseudo distributed mode so it means that all
the Hadoop daemons the master daemon and the slave daemon they run on the local
machine and then we have got a standalone or local mode it means that
there are no demons everything is running on a single virtual machine so
this is only suite above when you were just going to try out how do you want to
see that how hadoo works so this only for that but this completely violates
our concept of having a distributed file system because it is not distributed at
all when you have only a single machine but in pseudo distributed the difference
is that you can have virtualization inside even though the hardware is same
you can still have logical separations but this is also not advisable to use
since if that machine goes down your entire Hadoop cluster or your entire
Hadoop setup would be lost so you can go ahead and set up your hadoop cluster in
a pseudo distributed mode when you want to learn hadoop when you want to see how
the files get distributed and you want to get a first-hand experience on hadoop
you can go ahead and set up your hadoop cluster in a single machine by logically
partitioning it but when you talk about production you should always go ahead
with multi node cluster mode you should divide up the tasks and that is how
exactly you’ll get the benefits out of big data because unless you distribute
the tasks and unless all the tasks are performed parallely by different
machines by also having a back-up plan or by having a backup storage or by
having a backup node or back a machine for processing it when a single machine
goes down you won’t get the proper benefits of using Hadoop so that’s why
for production purpose you should always go ahead with multi node cluster mode so
this was all about Hadoop clusters so now let us go ahead and see the Hadoop
ecosystem so this is the Hadoop ecosystem and this
is nothing but a set of tools which you can use for performing big data
analytics so let’s start with flu and school which are used for ingesting data
into HDFS now I already told you that data has been generating at a very high
velocity so in order to cope up with the velocity we use tools like zoom and
scoop in order to ingest the data into a processing system or our storage system
because it is getting generated at a very high rate so flume and scope acts
like funnel in order to store the data for some time and then ingest it
accordingly the flume is used to ingest unstructured and semi-structured data
which are nothing but mostly social media data and scope is used to ingest
structured data like excel sheets Google sheets something like that and you
already know what HDFS is this is a distributed file system which is used
for storing big data we have also discussed about yarn which is nothing
but yet another resource negotiator this is meant for processing big data and
apart from that we have got many other tools in our Hadoop ecosystem so we have
got high V R now High’s if used for analysis
so it was developed by Facebook and it uses high query language which is very
similar to sequel so when Facebook developed high and when they wanted to
start using it they didn’t have to hire people who knew HTML because they could
already use the people who are experts in sequel and it’s very similar to that
now we have got another tool for analytics which is Pig now Pig is really
powerful and one big command is almost equal to 20 lines of MapReduce code so
obviously when you run that Pig command that one line pick command the compiler
implicitly converts it into a MapReduce code but you have to only drive one
single Pig command and it will perform analytics on your data circuits park
over here which is used for near real-time processing and for machine
learning we’ve got two more tools SPARC ml lip and mahute so again we’ve got
tools like zookeeper and embody which is used for management and coordination so
Apache embody is a tool for provisioning managing and monitoring the Apache
Hadoop clusters and over here is e is a workflow
scheduler system in order to manage Apache Hadoop jobs and this is very
scalable reliable and an extensible system then apache storm this is used
for real-time computation which is free and open source and with storm it is
very easy to reliably process unbounded streams of data then we’ve also got
Kafka which handles real-time data feeds and we’ve got solar loosin which is used
for searching and indexing so these are the set of tools in Hadoop ecosystem and
according to your need you need to select the best tools and come up with
the best possible solution so you don’t have to use all the tools at the same
time so this was Hadoop ecosystem any questions or doubts
all right so now let us take a look at a use case to understand how we can use
Hadoop for big data analytics in real life and when understand it by taking an
account and analyzing our Olympic data set so let us see what we’re going to do
with this data set and how this data set looks like so we have Olympic data set
and we’re going to use a Hadoop tool which is known as big in order to make
some analysis about this data set now let me tell you a little bit about big
before going ahead with this use case so big is a very powerful and a very
popular tool that has been widely used for big data analytics and we think you
can write complex state of transformations without the knowledge of
Java you saw the earlier program that we wrote that was fairly simple this was
just a small MapReduce program but it had almost 70 to 80 lines of Java code
and if you’re not good at Java it might be Lin hard for you so now you don’t
have to worry because we have got big and big users its own language which is
known as big Latin and this is very much similar to sequel and they also has
various built-in operators for joining filtering sorting to process large sets
of data and also let me tell you a very interesting fact that ten lines of big
code is almost equal to 200 lines of MapReduce code so that is why fig is so
popular because it is very easy to learn and it is very easy with big to deal
with large data sets so now we have got the Olympic data set now this is fairly
small but just for an example let me show you and let me tell you what we are
going to do with this data set so these are the things that we’re going to make
analysis about the Olympic data set so at first we’re going to find the list of
top ten countries that have won the highest medals and then we’re going to
see the total number of gold medals won by each country and we’ll also find out
which countries have won the most number of medals in a particular sport which is
swimming so this is what we’re going to find out and now let us take a look at
our data set so this is a brief description of my data set so I’ve got
these in my data set so the first field is
athlead and this consists of the name of the athlete then we have got the age of
the athlete the country which an athlete belongs to the year of Olympics when the
athlete played the closing date is the date when the ending ceremony was held
for that particular Olympic year the sport which an athlete is associated to
the number of gold medals won by him or her number of several medals number of
bronze medals and the total medals won by a particular athlete and this is what
our data set looks like so here is athlete and this is the field athlete
and contains the name of the athlete like Michael Phelps Natalie Coughlin
Alex animals then the age of the athletes the country the United States
the year 2008 the closing ceremony day this is the date
sport is swimming gold medals aid silver medal zero bronze four zero total medals
eight so this is how our data set looks like and we’re going to perform some
operations on this data set in order to make some analysis and some insights
using pic so let us go ahead and let me show you how to do that so this is my
terminal where I have got my Hadoop set up and we’re going to use big for that
so now I have already loaded my data set in my HDFS let me show you where my data
set actually lies so Hadoop SS – LS so these are my input and output
directories so let us go ahead you spake and make this analysis so all my results
will be stored over here and I’ll go ahead and show it to you once you
perform all the operations so the first thing that we’re doing
we’re going to find the list of top 10 countries with highest medals so let me
go ahead and open big so this is the shelf or big
so the first thing that we need to do we need to load the data set into Pig so
for that I’m going to use a variable I’m going to store the data set in this
variable and this is the command that I’m using which is load and then you
have to mention the name of the directory which is Olympic slash input
and you also have to mention the name of your data set which is Olympics
underscore data and it is a CSV file after that you have to write using big
storage and we’re going to use a delimiter sine
D now I’ll tell you why because if you remember in our data set all of the
fields that we have they are separated by using a tab and that’s why we have
used flash T as our delimiter here and make sure that after you end each line
of pig code you end it with a semicolon just like how you do in sequel now press
Enter now let us go check out this variable
olimpic so for that you can use this command dump and the name of the
variable so my data set has been loaded so this
is it so here we have got all the fields mentioned we have got the name of each
of the player the age the country where they belong to the year of the Olympics
the closing date ceremony the sport each of this athletes are associated to the
number of gold medal silver bronze and ultimate so my entire dataset has been
loaded into the variable Olympic so if you remember what we are going to do we
are going to find a list of top 10 countries with the highest medals so we
don’t need all the fields here we just need the field where we have got the
country name and the total medals so for that I will write one more code but
first I will clear the screen so I’m going to use another variable here so
let us call it country final and let me write the code so let me say for each
Olympic generate to add
country total medals now these numbers that you
see dollar two and nine these are index so let me go back to our data set and
let me show you why I have mentioned two and nine here so this is our data set
and the index of all the fields it starts from zero so athlete is at 0th
index ages at one country is a two and total medals is at six so we only need
the country and the total medals and that’s why you’ve mentioned the indexes
of the country field and the total medals field only so now let us go ahead
and execute this let us go check this variable so this is our another intermediate
result so you can see all the countries are present and there is a hard-coded
value one so you can see that now all the countries we have got one Ukraine
here and two over here so what we want to do now we want to group all the same
countries together so for that we’ll use this again I’m using a variable to group
all the countries together we’re calling it grouped so and then execute this
command so group country final by country now let us check grouped so now I can see all the same countries
are grouped together got trinidad and tobago here serbia and
montenegro czech republic’s all the countries are grouped together so this
result is also intermediate if you remember in the previous MapReduce
program we also got a similar value like this and then what we did we counted it
and finally gave the final result and that is exactly what we’re going to do
now let me tell you also that each Pig code that you run it gets implicitly
transformed or it gets implicitly translated into a MapReduce code only so
whatever is happening we did the similar thing in our previous code also so now
we’ll go ahead and we’ll count them so now let me use another variable to store
the results so let me call it final results and this is the command so for
each grouped generate group and in order to
count it we’re going to use a in dual function in bake which is called count
and here we’re using our country final and total meadows as a scout now let us go check the final
result and there it is so South Korea has got
274 total medals where Tirico has to North Korea has 21 Venezuela has four
but if you see it right now this is not in a sorted order and we want the top
ten so let us sort it so that we can have the highest medal winners on the
top clear it in order to sort it I’m going
to use this variable I’m going to store the sorted result in this variable
called sort and now I’m going to write and I want to order the final result by F count and I wanted in a descending
order this cool chick sort so now we have got all the countries in
a sorted manner so if you scroll up we can see that
United States has got the highest medals then comes Russia Germany Australia
China but I have got the list of all the countries with me and I wanted only the
top ten countries so I will eliminate all the others and I will just select
the top ten so for that let me use another variable to store the names of
the top ten countries only and let me call it final count and let me use this
limit short ten now you give me only the top ten values
now let us go check final count so this is our final result we have got
the name of the top ten countries with the total number of medals a particular
country won and so this is our final result so let’s go ahead and store this
result in our output directory so for that I’ll use this command store
final count into the name of your directory
which is Olympic slash output and let me store it in a particular file let me
call it use case first and it’s a success so the final result
has been successfully stored in a file in my output directory similarly we can
go ahead and find out the answers to the other two questions that we already had
so the second one was to find the top ten countries that won the highest
number of gold medals now this is completely similar to the first one that
we did only instead of selecting the field with total medals we’ll select the
field for gold medals this time and apart from that all other steps will be
same so the gold medals will be in the 6th index instead of writing 9 we should
write 6 in this case and third one was to find out which countries have won the
most number of medals in swimming so let me go ahead and execute you this one so
this is also very very similar again instead of just two we have to select
three fields because there is one more field which is the sport field involved
in this one so let me just go ahead and run the same command so the first thing
that we need to do we have to load our data set so it’s the same so we have
loaded this and now for the second one now the second one instead of two fields
we have to select three fields so generate two as the country this is fine
we’ll add another one the sport was in the fifth index so go ahead and mention
that so 5 as sport a 9 as total medals since we want it for a particular sport
which is swimming so we’ll filter out all of the sports and we’ll only take in
accounts to me but first let me clear my screen so I’m using another variable let
me call it athletes filter and I’m going to use another inbuilt function for that
which is known as filters so I’m going to filter country final by sport and sport is
swimming so now let us go ahead and check
athletes filter so there we have we have got only the
country name the sport swimming now again this is another intermediate
result we want to group all the countries together again
for that let me use this variable called final group
and we’ll use another inbuilt function which is called group ashle filter by country let’s go and check out final group so again we have grouped all the
countries together now we’ll go ahead and count it
and now we’ll go ahead and we’ll use a similar count function that we did
before so let me use another variable over here let me again call it final
count make sure to have a space here
so for each final group generate crew and use the count function
you mentioned athlete filter now let’s go ahead and check final count again it’s not sorted and we want to see
the top country who always win medals and swimming so again we’ll go ahead and
sort it so start our final count by F count and I want the top country
first all sorted in descending order so let me go ahead and check out sort so there we have so I know you guys
already guessed it so it’s obviously going to be United States and Michael
selves wanted all so yeah so we’ve got the United States on the table then
we’ve got Australia Netherlands Japan China Germany France now if you want
only the top five or top ten you can do it in the similar way by using limit so
but if you want to keep it this way you can do that so now this is the final
result that we want and we’re going to store it in our output directory so
again we’ll use the same command store sword in mouth but directory which is
again Olympic slash output and let me just or it in a file
called use case of us three and enter and so again this is successful now
let’s come out of the picture now this my terminal and now let us view the
first result that we have got and we have stored it in our output directory
so for that type Hadoop FS – get my output directory so it was in this side use case first
and Astrix zero so there is my result it was
successfully stored in my output directory and there it is so this is how
you can use big in order to make analysis now this is a very small data
set and very easy analysis that we make you can perform some very complex ones
also using big and you just have to write only a few lines of code so I hope
that you all have understood this use case if you have any doubts you can ask
me questions right now so do you have any questions alright thank you everyone
for attending this session I hope that you’ve all learned about Hadoop but if
you have any queries or any doubts kindly leave it on the comment section
below this video will be uploaded on your LMS and I’ll see you next time till
then happy learning I hope you enjoyed listening to this video please be kind
enough to like it and you can comment any of your doubts and queries and we
will reply to them at the earliest do look out for more videos in our playlist
and subscribe to our at Eureka channel to learn more happy learning

Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Hadoop Training | Edureka

95 thoughts on “Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Hadoop Training | Edureka

  • October 5, 2017 at 6:21 am

    Nice presentation keep it up

  • October 13, 2017 at 12:26 pm

    Really good. Instead of single lengthy video, break it with multiple topics(playlist). Just a suggestion. 🙂

  • October 17, 2017 at 2:42 pm

    Thanks Reshma for your instructive and clear video tutorial. This is one of the best instructions on Hadoop for beginners.

  • October 20, 2017 at 6:30 pm

    Hey Reshma. That was very much informative.. How about the job openings outside for PIG scripting with basic knowledge as I am very much new to BIG data but little older with SQL knowledge.

  • October 22, 2017 at 2:05 am

    where can i find the software used in that video ? thanks

  • October 28, 2017 at 4:39 pm

    such a beautiful video guys iam from mechanical background and i was struggling to understand hadoop in my bigdata classroom sessions but just then i saw your video and i got a clear picture of what exactly is happening ……good work guys ….really looking forward to learn more courses from you guys in future……..

  • November 9, 2017 at 4:23 pm

    can you share your ppt? say thanks first.

  • November 23, 2017 at 4:38 am

    It is great job. Really helpful!

  • November 27, 2017 at 5:19 am

    Excellent piece of video on Hadoop who doesn't have any technical knowledge

  • November 29, 2017 at 6:35 am

    while creating pipeline if datanode 1 and datanode 4 work fine but datanode 6 is corrupted then will block A copy to data node 1&4 or client node will not receive any conformation because datanode 6 is corrupted

  • December 15, 2017 at 5:00 pm

    Clear, Crisp tutorial – Great job!!

  • December 20, 2017 at 5:24 am

    Last month i joined Hadoop classes daily a hour class you covered every thing in just 1-40 min just amazing please make more practical videos ,

  • December 20, 2017 at 6:44 pm

    Can anyone explain this to me, please? It would be very appriciated! Each datanode has a limited size (128MB), so when the file gets distributed at the beginning, there will be fully filled-up datanodes. How can the replication factor come to the picture?

  • December 23, 2017 at 4:56 am

    I didn't understand pipeline for copying. If we copy the block A three times which are available in three different racks. It is just a duplicate right?

  • December 24, 2017 at 8:21 am

    I was totally new to hadoop before watching this video. Now i have pretty much clarity. Thank you Ma'am, really very nice explaination.

  • December 29, 2017 at 10:48 am

    This is an excellent overview. It's very rare when you comes across someone that knows the technology so well that thy can break it down to the simplest level. Congratulations, I hope to learn much more from you if this is an indication of the standard of your training.

  • January 6, 2018 at 1:19 pm

    all the 47 tutorial videos which u upload is it enough to being master in hadoop ?? or it's compulsary to join your tutorial class ??

  • January 17, 2018 at 10:57 am

    yours tutorials are very nice i understand everything.. but i want that csv file u used in the tutorials

  • January 23, 2018 at 1:42 pm

    I have a question mam…And my question is you told that mapReduce code is long and little bit hard but pig is easy to learn so can I leave mapReduce topic and learn pig is it good? or I have to learn both mapReduce and pig? please mam help me

  • January 24, 2018 at 3:21 pm

    why Hadoop is split into block

  • January 31, 2018 at 7:40 am

    Very interesting.

  • January 31, 2018 at 8:57 am

    Hi Reshma it was beautifully explained and easy to understand.

    I'm eagerly waiting to c more of these Thank You

  • January 31, 2018 at 11:46 pm


  • February 4, 2018 at 5:07 pm

    Great explanations with very good examples.

  • February 6, 2018 at 5:48 pm

    Very helpful…

  • February 7, 2018 at 6:14 am

    awesome.. thanks..

  • February 10, 2018 at 2:46 pm

    can i get the ppt of your presentation madam

  • February 17, 2018 at 10:03 am

    Thank u so much mam….Nice explanation👍… voice clarity is awesome…This is the exact video what I was looking for

  • March 11, 2018 at 10:20 pm

    This tutor is a good educator! She makes it very simple to understand. I hope to find more of her tutorials.

  • March 12, 2018 at 1:48 am

    Each data blocks are replicated thrice and are distributed among all data node. Will it not increase the data size? As we will also need storage to store the 3 times of actual data.

  • March 13, 2018 at 6:40 am

    This video has been really useful for me to learn about Big Data. The examples are great to understand each part of Hadoop. Congratulations!

  • March 14, 2018 at 7:49 am

    Excellent mam…..thanks alot…

  • March 20, 2018 at 6:43 am


  • March 23, 2018 at 9:34 am

    Thank you for this tutorial

  • March 23, 2018 at 8:05 pm

    Excellent video, explained well

  • March 24, 2018 at 7:41 pm

    Awesome video! Thanks 🙂

  • March 28, 2018 at 7:08 pm

    Thank you for the video! Great teacher.

  • March 30, 2018 at 6:33 am

    Thanks for using good english! Normally I can't understand Indian tutorials due to heavy accents. Your english is excellent!

  • March 31, 2018 at 6:10 am


  • April 4, 2018 at 7:22 am

    Very Very Helpful tutorial. Thanks for making this tutorial.

  • April 11, 2018 at 1:21 pm

    Got a question on the topic? Please share it in the comment section below and our experts will answer it for you. For Edureka Hadoop Training and Certification Curriculum, Visit our Website: http://bit.ly/2KqiXsG Use code "YOUTUBE20" to get Flat 20% off on this training.

  • April 20, 2018 at 7:45 pm

    Very clear and concise. Thank you.

  • April 22, 2018 at 6:47 pm

    Very informative. Thank you.

  • May 1, 2018 at 6:53 am

    Thanks this was super helpful and your explanations are very clear.

  • May 8, 2018 at 8:06 am

    Are the videos in playlist is similar in order to watch

  • May 11, 2018 at 12:36 am

    This is by far the best presentation from edurika in my opinion. The lady-presenter has a perfectly understandable English, smooth narration and wonderful PIG section presentation. Thank you so much!!

  • May 12, 2018 at 7:27 am

    Hi, at 56:36 Can you explain what are the two blank spaces u are talking about.Thank you.
    Btw very good presentation.

  • May 12, 2018 at 11:20 pm

    Some people are just BORN TO TEACH! This tutor is one of them. Great job!

  • May 13, 2018 at 7:27 am

    I am learning hadoop in the 53 videos presented by edureka.. what is the certification procedure after completing these 53 videos???

  • May 27, 2018 at 10:26 am

    Awesome .. A very good explanation..

  • June 4, 2018 at 10:11 am

    Thanks for the tutorial

  • June 5, 2018 at 7:48 am

    Hello sir

  • June 5, 2018 at 10:11 am

    When there are more chances of failure of commodity hardware then what's the reason of using it?

  • June 6, 2018 at 1:52 pm

    Is it necessary to learn java o? Or python is OK for it?

  • June 10, 2018 at 10:27 pm

    Nice tutorial, Examples taken are very each to understand and meaningful.

  • June 11, 2018 at 2:54 am

    very good ~~  very helpful !!

  • June 15, 2018 at 10:03 am

    even though the beginning is a bit slow the rest of the video makes the patience pay off 🙂 very good video! thanks!

  • June 19, 2018 at 11:00 pm

    Hi. It was a really good session, Please let me know how the tools connect with HDFS server?

  • June 21, 2018 at 7:11 am

    Hi Edureka,
    The explanation of MapReduce is awesome.. I have one question, Most of all SQL operations are done in PIG then where we will use the HIVE and what are the use-cases of HIVE.

  • June 24, 2018 at 7:33 am

    Awesome videos .. I have learned so many things..

  • July 2, 2018 at 11:45 am

    Excellent mam, Thank you very much.

  • July 10, 2018 at 11:42 am

    this is complicated

  • July 10, 2018 at 11:57 am

    the java reducer code is like Chinese to me I don't know java

  • July 10, 2018 at 6:58 pm

    One of the finest and simple presentation i have ever seen in my life

  • July 14, 2018 at 11:51 am

    Thank you so much for uploading such video nd your content is super powerful and the best part is it seems to b vry simpl nd easy to leran

  • July 26, 2018 at 2:28 pm


  • August 11, 2018 at 3:22 pm

    Well explained 🙂👏

  • August 15, 2018 at 9:37 pm

    Great tutoring… for Hadoop..

  • August 21, 2018 at 9:44 am

    Very nice explanation … grt Neha

  • August 24, 2018 at 6:17 pm

    Brilliant tutorial, very informative and well explained. Thanks very much.

  • August 29, 2018 at 1:42 pm

    Wonder explanation and easy to understand.. Thanks !! Will keep watching this space for more..

  • September 3, 2018 at 8:37 am

    love # respect from PAKISTAN great teacher ! informative lecs! will share this will of my other data scientist! #great_Indians <3

  • September 14, 2018 at 4:55 am

    Hi Guys, Thank you for being so generous to share such an informative and quality training stuff openly. Presenter is perfect. All the best team for all your future endeavors!

  • October 7, 2018 at 2:36 am

    Very good tutorial,very clear explanations..

  • October 7, 2018 at 2:38 am

    How much data can be stored across each node ?is there any limitation?

  • October 17, 2018 at 7:45 pm

    is java mandatory in bigdata ,if so how much amount of java knowledge is needed for performing hdfs tasks at job and also are there any other programming languages we should have knowledge of????

  • October 26, 2018 at 1:30 am

    is there any limit on the data stored across each Data node/slave node?or How many blocks of data can be stored across a Data node?

  • October 26, 2018 at 1:38 am

    Do we need to know java to do map reduce ?

  • October 26, 2018 at 8:39 am

    I have a question about master and slave nodes. What if Bob is on leave and John does not have time to deal with project C even if he takes it as a backup, the manager still cannot complete the project C on time. Or am I missing something? Thanks.

  • October 28, 2018 at 1:38 pm

    what a excellent presentation

  • November 3, 2018 at 10:24 am

    very good presentation

  • December 9, 2018 at 6:46 am

    A nice and good tutorial to understand a basic concept of Hadoop and it's eco system.

  • January 2, 2019 at 6:42 am

    How can I register for this specific instructor's course? What is her name? And how much per course?

  • January 20, 2019 at 5:42 pm

    how to learn programming in pig??

  • January 27, 2019 at 3:53 pm

    This video was very helpful, I'm totally new and I learnt a lot very quickly.

  • February 3, 2019 at 5:36 pm

    Wow! The way you gave the first 20 min intro was awesome. Even a 12 year old can understand this

  • February 16, 2019 at 5:52 am

    Thanks Reshma for your instructive and clear video tutorial, I joined hadoop Big Data course in edureka last month, I want to learn PIG programming end to end [PIG scripting], Can you please advice where i can learn more PIG scripting concepts.

  • February 18, 2019 at 5:28 am

    where actually the resource manager resides?

  • April 26, 2019 at 10:22 am

    Very helpful and crisp content..

  • May 20, 2019 at 1:45 pm

    Can we perform replication at NameNode level

  • June 23, 2019 at 1:19 pm

    Nice work!

  • July 24, 2019 at 7:44 pm

    Nice explanation


Leave a Reply

Your email address will not be published. Required fields are marked *