Understand Hadoop and Its Ecosystem

The main purpose of writing this content is to provide the basic understanding about Hadoop. But, before starting the introduction of Hadoop, let me ask some basic questions which arise in my mind and may also arise in your mind when you heard the word Hadoop, such as:

Why Hadoop?
Is Hadoop the only solution for Big Data? What are alternatives?
What are the pros and cons of Hadoop?

Well, we will cover these entire questions but in the next part of this article. In this article, I am sharing the basic parts of Hadoop Ecosystem. Why Hadoop is so much popular, the power of Hadoop and some basic knowledge of each Hadoop components. Let’s begin.

Hadoop Ecosystem

Hadoop is not like other technology where you can learn or offer services, rather this Hadoop is a platform where you can build splendid applications. Since the rise of Hadoop, there are dozens of software communities have been developing modules that can address a variety of problem specs and meet different needs.

In this article, we are going to get a brief introduction to the world of Hadoop and its Ecosystem. With the Hadoop Ecosystem, there are tons of commercial as well as open-source products that are widely used to make Hadoop laymen accessible and more usable. Here we are going to discuss some specific components that give core functionality and speed to Hadoop.

HDFS – Hadoop Distributed File System

When file size grows from GBs to TBs, we need some concrete system that can manage this entire file. As we all know, a database can be distributed over multiple servers, in the same way when multiple file sizes increase, a single machine can’t process those files, we need a distributed file system (that manages storage across the network). HDFS is specially designed to store huge amounts of file sizes where operations are written once and read multiple times. For Latency data access, HDFS is not a good option, as there are lots of small files included in operation for the update purpose.

Map Reduce

Map Reduce is used to process highly distributable issues where so many computers are collectively used (known as a cluster). In terms of programming, two functions are most common in Map Reduce.

1) Map
2) Reduce

In the map step, the Master computer or node takes input (Large data set to process) divides it into smaller parts, and distributes it to other worker nodes (other machine/Hardware). All worker nodes solve their small problem and answer the master node.

In the Reduce step, the Master node combines all answers coming from the worker node and forms some form of output which is the answer to our big distributed problem.

Hive

If you have gone through the limitations or disadvantages of Hadoop you might find that working with Hadoop is not easy for end users especially those who are not familiar with the MapReduce framework. Hive was specially developed for People who have strong control over SQL Queries, Hive provides an SQL like language called HiveQL.

The main building blocks of Hive are –

Metastore – To store metadata about columns, partitions, and system catalog.
Driver – To manage the lifecycle of a HiveQL statement
Query Compiler – To compile HiveQL into a directed acyclic graph.
Execution Engine – To execute the tasks in proper order which are produced by the compiler.
HiveServer – To provide a Thrift interface and a JDBC / ODBC server.

When all these components are merged, it makes Hadoop very user-friendly.

HBase

As said earlier, HDFS (Hadoop Distributed File System) works on write once and read many times pattern, but this isn’t the case always. We may require real-time read/write random access for huge datasets; this is where HBase comes into the picture. HBase is built on top of HDFS and distributed on column-oriented database. HBase is not relational and doesn’t support SQL, but it’s specially designed to perform what RDBMS can’t such as to host a very large dataset over a cluster.

Mahout

The success of big data depends on how fast and efficiently you can convert a vast amount of information into actionable information. Whether it is analyzing all logs (visited web pages) of Cyberoam for a particular company, like Azilen Technologies, to know which site is most frequently opened in Azilen or if you want to proceed with thousands of personal email messages and to generate analytics and then organize and extract information, Mahot is being used here.

If you have seen the English season “Person of Interest” you may found that in this serial Person builds a machine to collect the data of people of New York City to identify terrorists and the machine separates these people into two sections relevant and irrelevant. Relevant people are people who can be a threat to national security and others are irrelevant. In the same way, Mahout Recommender Engines tries to identify unknown items from their area of interest and distributes them in a more categorized way.

Sqoop

To understand Sqoop, let’s say we have released one version of software and now it’s in production and you are implementing some new features and then releasing a new version which requires migration from old data to new data. In the same way, even if you love the new concepts of big data, your existing data is still in SQL over the distributed server. Now, how to convert/migrate these data in such a form that it can be useful to Big Data concepts. This is the scenario where Sqoop comes into the picture.

Loading bulk data from the production system and accessing it in the Mapreduce application can be a challenging task. Transferring data using SQL or other technology scripts is inefficient and time-consuming. Sqoop is similar to a SQL Server Integration Service which allows easy import and export of data from structured data such as relational databases, enterprise data warehouses, and NoSQL systems. Sqoop also sliced up data into different partitions and a map-only job is launched with individual mappers responsible for transferring a slice of this dataset.

Pig

To analyze and querying a huge data set is not easy with any ordinary language. When we are talking about analyzing Pera bytes of data, it requires an extensive level of parallel mechanism to analyze data.

Pig is used to analyze and fire queries on large data sets that consist of high-level language which is for expressing data analysis programs, coupled with infrastructure for evaluating these. Pig Latin has main three key properties:

Extensibility
Optimization opportunities
Ease of programming

Flume

In the real world, many common problems can’t be solved by the existing components like if I have well defined structured database and ask for help from some software engineer then each one can implement their logic and do analytics for me. But, data is not always in a structured manner. Let’s take one example if you assume Google is taking logs. I assume that Google has a very strategic and good way to do this but this is only an assumption that each day every keyword typed in the Google search box is written in the file. Now, do analytics for Google. I guess you are getting my points. All logs, whether they are event-based or specific action-based, are most difficult to analyze. these data that is where I may present Flume.

Flume is a utility for harvesting, aggregating, and moving huge amounts of log data or text files in and out of Hadoop. The input of flume can be anything like Avro (another Hadoop component), files, system logs, HDFS, and HBase. Flume itself has a query processing engine, so it’s easy to transform each new batch of data before it is shuttled to the intended sink.

So far,

From All the above-discussed components just a few are well-known components that are developed by famous software companies and each one is designed for a very specific purpose so, for Hadoop, I can say that it’s not a single person or company idea to build it. With Hadoop and its ecosystem most interesting thing I found is the names of each component. You can also see that each component has a unique name and every name has its own story. Hadoop is powerful because it is extensible and it is easy to integrate with any component.

Power of Hadoop

Well, Hadoop is powerful yet widely used by big brands in the real world. In this section, I am going to share some case studies that tell their stories.

Case Study – 1: Last. fm

Last. fm is personalized radio which has 40M unique users and 500M different page views each month. Every page view leads to at least one log line and different actions lead to multiple logs. The site is doing analytics of this vast data set and they find site states, and reporting charts via Hadoop.

Case Study – 2: Our favorite Facebook

To manage a large size of Facebook users, determine application quality, and Generate statistics about site usage, Facebook is using multiple Hadoop Clusters. With the use of the Hadoop Cluster, Facebook is loading 250 GB every day and has hundreds of jobs running each day on this data. An amazingly large fraction of Facebook engineers have run Hadoop jobs at some point.

Case Study – 3: Google itself

Google is always looking for a better & faster sorting algorithm. Google was able to sort 1TB size of files stored on Google file system in 68 Seconds with use of Hadoop MapReduce Programs. It’s a Google so they are not stopped with 1TB files they have an experiment to sort 1PB data in (1000 TB) 6 hours and 2 minutes using 4000 machines.

What’s Next in Azilen for Hadoop?

In Azilen, we are planning to make a pilot project for Hadoop. , we have a log of the following web activity of users like which site is browsed by the user and how much time is spent on that site. With the help of these logs, we will be performing 3 steps.

Analyze Logs and migrate them into more structured formats.
Perform analytics with the use of Hadoop Ecosystem’s components
Generate Reports to give rank on the popularity of websites in Azilen
Reports will be displayed in Liferay.

We are doing development on these as a proof of concept and to prove our capabilities in Hadoop. We will be publishing our findings and results shortly.