Understand Hadoop and Its Ecosystem
The main purpose of writing this content is to provide the basic understanding about Hadoop. But, before starting the introduction of Hadoop, let me ask some basic questions which arise in my mind and may also arise in your mind when you heard the word Hadoop, such as:
- Why Hadoop?
- Is Hadoop the only solution for Big Data? What are alternatives?
- What are the pros and cons of Hadoop?
Well, we will cover these entire questions but in the next part of this article. In this article, I am sharing the basic parts of Hadoop Ecosystem. Why Hadoop is so much popular, the power of Hadoop and some basic knowledge of each Hadoop components. Let’s begin.
Hadoop is not like another technology where you can learn or offer services, rather than this Hadoop is a platform where you can build splendid applications. From the rise of the Hadoop there are dozens of software communities have been developing modules which can address variety of problem specs and meeting different needs.
In this article we are going to get brief introduction about the world of Hadoop and its Ecosystem. With Hadoop Ecosystem there are tons of commercial as well as open source products which are widely used to make Hadoop laymen-accessible and more usable. Here we are going to discuss about some specific components which gives core functionality and speed to Hadoop.
HDFS – Hadoop Distributed File System
When file size grows from GBs to TBs, we need some concrete system who can manage this entire file. As we all know, that database can be distributed over multiple servers, as the same way when multiple file size increased, a single machine cant process that files, we need distributed file system (who manage storage across network). HDFS is specially designed to store huge amount of file size where operations are write once and read multiple times are done. For Low Latency data access, HDFS is not a good option, as there are lots of small file included in operation for the update purpose.
Image Courtesy: yourstory.com
Map Reduce is uses to process highly distributable issues where so many computers are collectively used (known as cluster). In terms of programming, there are two functions which are most common in Map Reduce.
In map step, Master computer or node takes input (Large data set to process) and divide it into smaller parts and distribute it on other worker nodes (other machine/Hardware). All worker nodes solve their own small problem and give answer to the master node.
In Reduce step, Master node combines all answers coming from worker node and forms it in some form of output which is answer of our big distributed problem.
If you have gone through the limitations or disadvantages of Hadoop you might found that to work with Hadoop is not easy for end users especially who are not familiar with Mapreduce framework. Hive specially developed for People who have strong control on SQL Queries, Hive provides SQL like language called HiveQL.
The main building blocks of Hive are –
- Metastore – To store metadata about columns, partition and system catalogue.
- Driver – To manage the lifecycle of a HiveQL statement
- Query Compiler – To compiles HiveQL into a directed acyclic graph.
- Execution Engine – To execute the tasks in proper order which are produced by the compiler.
- HiveServer – To provide a Thrift interface and a JDBC / ODBC server.
When all these components are merged, it makes the Hadoop very user friendly.
As said earlier, HDFS (Hadoop Distributed File System) works on write once and read many times pattern, but this isn’t a case always. We may require real time read/write random access for huge dataset; this is where HBase comes into the picture. HBase is built on top of HDFS and distributed on column-oriented database. HBase is not relational and doesn’t support SQL, but it’s specially designed to perform which RDBMS can’t such as to host very large dataset over a cluster.
A success of big data is depends on how fast and efficiently you can convert a vast amount of information to some actionable information. Whether it is analyzing all log (visited web pages) of Cyberoam for particular company, like Azilen Technologies, to know which site is most frequently opened in Azilen or if you want to proceed thousands of personal email messages and to generate analytics and then to organize and extract information, Mahot is being used here.
If you have seen the English season “Person of Interest” you may found that in this serial Person builds a machine to collect the data of people of New York City to identify terrorists and machine separates these people in two sections relevant and irrelevant. Relevant people are people who can be threat to national security and others are irrelevant. As the Same way Mahout Recommender Engines tries to identify unknown items from their area of interest and distributes in more categorized way.
To understand Sqoop, let’s say we have released one version of software and now it’s in production and you are implementing some new features and then releasing a new version which requires migration from old data to new data. In the same way, even if you love the new concepts of big data, your existing data is still in SQL over distributed server. Now, how to convert/migrate these data in such form that it can be useful to Big Data concepts. This is the scenario where Sqoop comes in the picture.
To load bulk data from production system and to access it in mapreduce application can be a challenging task. Transferring data using SQL or other technology scripts is inefficient and time-consuming. Sqoop is similarly to a SQL Server Integration Service which allows easy import and export of data from structured data such as relational databases, enterprise data warehouses, and NoSQL systems. Sqoop also sliced up data into different partitions and a map-only job is launched with individual mappers responsible for transferring a slice of this dataset.
To analyze and querying a huge data set is not easy with any ordinary language. When we are talking about analyzing Pera bytes of data, it requires extensive level of parallel mechanism to analyze data.
Pig is used to analyze and to fire query on large data set that consists high level language which is for expressing data analysis programs, coupled with infrastructure for evaluating these. Pig Latin has main three key properties:
- Optimization opportunities
- Ease of programming
In real world there are many common problems which can’t be solved by the existing components like if I have well defined structured database and ask help to some of software engineer then each one can implement their logic and do analytics for me. But, data is not always in structured manner. Let’s take one example, if you assume Google is taking logs. I assume that Google has very strategic and good way to do this but this is only an assumption that on each day every keyword typed in Google search box are written in file. Now, do analytics for Google. I guess you are getting my points. All logs, whether they are event based or specific action based, are most difficult to analyze. these data that is where I may present Flume.
Flume is a utility for harvesting, aggregating and moving huge amounts of log data or text files in and out of Hadoop. Input of flume can be anything like Avro (another Hadoop component), files, system logs, HDFS, HBase. Flume itself has a query processing engine, so it’s easy to transform each new batch of data before it is shuttled to the intended sink.
From All above discussed components just few are well known components which are developed by famous software companies and each one is designed for very specific purpose so, for Hadoop I can say that it’s not single person or company’s idea to build it. With Hadoop and its ecosystem most interesting thing I found is names of each component. You can also see that each component has unique name and every name has its own story. Hadoop is powerful because it is extensible and it is easy to integrate with any component.
Power of Hadoop
Well, Hadoop is powerful yet widely used by the big brands in the real world. In this section I am going to share some case studies which tell their stories itself.
Case Study – 1: Last.fm
Last.fm is personalized radio which is having 40M unique users and 500M different page view each month. Every page view leads to at least one log line and different actions leads to multiple logs. Site is doing analytics of this vast data set and they find site states, reporting charts via Hadoop.
Case Study – 2: Our favorite Facebook
To manage large size of Facebook users, to determine application quality, to Generate statistics about site usage, Facebook is using multiple Hadoop Clusters. With use of Hadoop Cluster, Facebook is loading 250 GB every day and have hundreds of jobs running each day on these data. An amazingly large fraction of Facebook engineers have run Hadoop jobs at some point.
Case Study – 3: Google itself
Google is always looking for better & faster sorting algorithm. Google was able to sort 1TB size of files stored on Google file system in 68 Seconds with use of Hadoop MapReduce Programs. It’s a Google so they are not stopped with 1TB files they have experiment to sort 1PB data in (1000 TB) 6 hours and 2 minutes using 4000 machines.
What’s Next in Azilen for Hadoop?
In Azilen, we are planning to make pilot project for Hadoop. As, we have log of following web activity of users like which site is browsed by user and how much time is spent on that site. With help of these logs we will be performing 3 steps.
- Analyze Logs and migrate it into more structured formats.
- Perform analytics with use of Hadoop Ecosystem’s components
- Generate Reports to give rank on popularity of websites in Azilen
- Reports will be displayed in Liferay.
We are doing development on these as a proof of concept and to prove our capabilities in Hadoop. We will be publishing our findings and results in near future.