What is Apache Hadoop? Website Brand Review

Website Brand Review of Apache Hadoop

We’ve all heard of Apache® Hadoop® – well, at least heard of Hadoop, and by now you should realize it’s an Apache project! But when was the last time you took a critical eye to the actual Apache Hadoop project’s homepage?.

Here’s my quick review of the Apache Hadoop project, told purely from the point of view of a new user finding the project website.

What Is Apache Hadoop?

“Apache Hadoop (is) a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models”

“Hadoop is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.”

No, Really, What Is Apache Hadoop For?

Hadoop essentially is a platform that provides the underlying pieces that enable parallel processing of large and distributed datasets. In particular, Hadoop enables you to easily program an algorithm – the Map – across many different machines and databases – Reducing – in a parallel way, using a variety of common servers or cloud/container clusters. Hadoop is made up of three key services integrated together:

HDFS – Hadoop Distributed File System
A file system for massive amounts of data that provides high-throughput access and easily distributes and scales across multiple containers or clusters. This lets you get at your data from (almost) wherever it is, in a consistent and performant way.
Hadoop YARN
The YARN framework provides compute and resource management features in a distributed and highly available manner. YARN is the orchestrator that keeps all the parts – data and processing – talking to each other across a distributed cluster.
Hadoop MapReduce
Hadoop provides the core implementation of the MapReduce algorithm: taking a mapping function to somehow analyze a bit of data, and then reducing it, by applying it across a giant set of data in an efficient and parallel way. MapReduce takes analysis of data and farms it out across a whole cluster working at once, each machine doing the same map on a different bit of data. At the end, MapReduce pulls all the answers together, and to some degree, you can make the process faster by simply having a larger cluster – no changes to your programming needed.

“Hadoop” has clearly become one of the biggest terms people use to refer to “big data” processing as a whole. Thus while we at the ASF think of “Hadoop” as our software project, the usage in public varies widely from meaning “Apache Hadoop”, to essentially meaning “Our Company’s Solution using Hadoop”, to meaning “how you process big data” generically.

New User Website Perceptions

That is, what does a new user see “above the fold” when first coming to the Apache Hadoop project homepage? For their first impression, is it easy to find things, can they quickly understand how to get the software and contribute to the project, and is the design appealing and easy to follow?

The homepage is text heavy after the excellent elephant logo, and provides a very basic explanation of what Hadoop does, quickly moving into a list of Hadoop modules, and then related Apache projects. At this point in the big data lifecycle, many “users” will be programming to a project model that uses Hadoop, not necessarily directly to Hadoop APIs directly, so the listing of a variety of higher-level projects is important. The overall design suffers from “2010 era” symptoms due to it’s being built from Apache Forrest.

The main product / API documentation is built by Apache Maven, and features a different, slightly cleaner and more modern, but still text-heavy design. Separately, some content is still hosted on the MoinMoin wiki system, which has a very clunky appearance today.

The homepage features a direct “Getting Started” list, which points to the full documentation, which points to a pair of pages stepping through how to install and run basic examples of simple clusters running data analysis. The steps provide the technical details, but a thorough reading of the entire cluster setup script seems needed to even understand the basics. To some degree this is understandable: fully installing and understanding Hadoop requires knowing about networks, system setup, databases, and managing clusters, as well as managing all the analysis and database access your actual application will need.

There are no obvious “How To Contribute” guides, and while the overall documentation covers the basics if getting, configuring, and using Hadoop, the style tends to be dry and official. Each part of the documentation seems complete and goes into detail, but the tying together of all the different parts isn’t as obvious – and the entire documentation website includes almost a hundred major topic areas. It feels the information is focused on highly technical enterprise software and data engineers who already have a familiarity with clusters, MapReduce theory, and distributed systems.

The issues and mailing list pages are sparse, listing the almost 20 mailing lists and 4 JIRA repositories, but don’t obviously offer a “how to format patches” or “FAQ” listings. It’s clear there is detailed documentation about everything Hadoop has to offer, but as a non-data scientist, it’s not always obvious where to start or how I could help contribute to the project.

The unassuming Search box on the Hadoop homepage goes to an external search-hadoop site which offers a detailed Lucene search syntax, and is based on results from a wide variety of projects sitting atop Hadoop. This underscores the fact that using Hadoop directly is not necessarily that common; many users actually use products sitting atop the Hadoop framework at this point.

Apache Branding Requirements

Apache projects are expected to manage their own affairs, including making all technical and content decisions for their code and websites. However to ensure a small modicum of consistency – and to ensure users know that an Apache project is hosted at the ASF – there are a few requirements all Apache projects must include in their projectname.apache.org websites (or wikis, etc.)

  • Apache Hadoop is used on the homepage and announcements, but is usually shortened in other places.
  • Website navigation links to ASF pages included in the site’s navigation system, except missing “Security”.
  • Logo does not include TM; footers include a full trademark attribution on the homepage (Forrest generated site) but not on the Maven-generated portions of the documentation site.
  • DOAP file exists, but only with basic info and no releases listed.

SEO / Search Hits / Related Sites

Well, SEO is far outside of our scope (and debatable in usefulness anyway), but it’s interesting to see: how does a new user find the Apache Hadoop homepage when they were searching?

Searching for “Hadoop”:

Top hits: varies; either project homepage/wikipedia, industry vendor pages, or a plethora of sponsored ad content from software and services vendors in the big data space.

Searching for “Hadoop software”:

Similar to just searching for Hadoop; the term has clearly gone mainstream in a number of ways.

Social Media Presence

The Hadoop project has some social media presence, but it’s not featured obviously anywhere on the project website.

  • https://twitter.com/Hadoop appears to be the official feed, but has not been updated for a while. There are a variety of other *Hadoop* accounts, including a @TwitterHadoop which is the verified account by a Twitter engineering team.
  • There are hundreds of Hadoop groups and companies listed on LinkedIn, but it’s not obvious if any are specifically run by the project; most are apparently run by vendors or user groups.
  • Similarly with Google+: lots of groups of varying types and names.
  • http://stackoverflow.com/questions/tagged/hadoop is very active, although the about note for the tag seems outdated (although thorough). There are several other active tags, including hadoop2 and hadoop-streaming.

What Do You Think Apache Hadoop Is?

So, what do you think? Is Hadoop still the major way you process big data, or have you moved to other tools? Does your organization use the original Apache Hadoop product, or have you moved onto using a vendor’s version or the Apache BigTop release of Hadoop + other Apache projects?  Or do you do all your work in Apache Spark or another framework atop of a Hadoop cluster?

Note: I’m writing here as an individual, not wearing any Apache hat. I hope this is useful both to new users and to the Apache Hadoop community, not necessarily a call to change anything. I haven’t used Hadoop for any real deployments myself, so please do comment with corrections to anything I’ve messed up above!

What do you think?