What is Apache Hadoop? Website Brand Review

Website Brand Review of Apache Hadoop

We’ve all heard of Apache┬« Hadoop┬« – well, at least heard of Hadoop, and by now you should realize it’s an Apache project! But when was the last time you took a critical eye to the actual Apache Hadoop project’s homepage?.

Here’s my quick review of the Apache Hadoop project, told purely from the point of view of a new user finding the project website.

What Is Apache Hadoop?

“Apache Hadoop (is) a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models”

“Hadoop is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.”

Continue reading What is Apache Hadoop? Website Brand Review

Even better than Hadoop!

You know what’s even better than using Hadoop? Using Apache Hadoop!

Even better is Apache Ambari to manage your Apache Cassandra data store through Apache Hive with Apache Pig to make it simpler to write Apache Spark compute flows… Or, if you want it assembled for you, just grab the latest Apache BigTop, which already includes a bunch of Apache Hadoop related packages all together.

How can we do a better job of getting at least a single “Apache Hadoop” into some of the many media stories about Hadoop these days? It’s great that all these vendors are making great technology and projects that power big data, but with all their success and fancy marketing campaigns, you’d think we could get just a tiny bit of credit in the popular press with the actual committers on the core Apache Hadoop project itself. Or any of the other Apache project technologies that these vendors, other software companies – and just about every other company too – rely on every day to help make their websites work.

Would it hurt marketers and journalists and bloggers to throw in just one extra “Apache” before talking about the many free Apache software products that help power more than half the internet?

The ASF and Apache projects give away a tremendous amount of technology every day under our permissive Apache license – always for free. All we ask is respect for our trademarks, and a little bit of credit for the many volunteer communities that build Apache software.

P.S. Apache projects love to get more code, documentation, testing, and other contributions too! And the ASF has a Sponsorship program.

But what we we really want is what every human wants: just a little love. Just an extra Apache here and there makes us feel better.


What is Apache Hadoop?

There’s a lot of excitement around Hadoop software these days, here’s my definition of what “Hadoop” means:

Hadoop ™ is the ASF’s trademark for our Apache Hadoop software product that provides a service and simple programming model for the distributed processing of large data sets across clusters of commodity computers. Many people view Hadoop as the software that started the current “Big Data” processing model, which allows programmers to easily and effectively process huge data sets to get meaningful results.

The best place of all to learn about Hadoop is of course the Apache Hadoop project and community, which says this about the Hadoop software:

“(Hadoop) is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the (simple to program) application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.”

The Apache Hadoop project at the ASF is related to or has created a large number of notable modules, subprojects, or full projects at Apache, including:

There are a wide variety of vendors who provide Hadoop-related software, however the only source for Hadoop software itself is the Apache Hadoop project here at the ASF. We certainly appreciate the many companies who allow their employees to contribute work to Apache Hadoop and all of our projects, and also to the many Apache Corporate Sponsors. However I do hope that companies working in the Hadoop and related Big Data industry take stock of their marketing strategies, and ensure that their corporate marketing doesn’t shortchange the credit owed to the Apache Hadoop community itself.

We very much appreciate those corporate supporters who do provide plenty of credit to the ASF and the Apache Hadoop community – both the old hats, and the very new spinoff in the Big Data space. I just hope that some of the other players in the industry will carefully consider their public crediting (or lack thereof) to the ASF’s Hadoop brand and the many individual committers and contributors to the Apache Hadoop project.

As always, the Apache Hadoop website and mailing lists are the best place to learn about Hadoop software!

Oh, and remember:

Apache Hadoop, Hadoop, the yellow elephant logo, the names of Apache software products, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries

Congratulations to six new Apache projects!

In last week’s monthly meeting of the Board of Directors of the ASF, we approved the creation of six new Top Level Projects (TLPs) at the ASF. This is the most new TLPs ever created at once, followed only by the meeting of November, 2008 where 5 new TLPs were created (CouchDB, Buildr, the Attic, Qpid, and Abdera).

In this particular case, much of the growth comes from within existing projects, wherein subprojects communities within Hadoop and Lucene have matured sufficiently to deserve to manage their own fates, and to create their own Project Mangement Committees (PMCs) to take charge. To put this in another perspective, this is also reflective of the ASF’s growth; before this meeting we had over 70 TLPs and over 30 Incubator podlings, so an addition of 6 new TLPs is less than 10% growth for the month.

We should congratulate the Apache Traffic Server community first, since they went through the Incubation process and successfully graduated from an Incubator Podling into their own TLP. Soon to be served (once the website migration is complete) from http://trafficserver.apache.org/, Apache Traffic Server is fast, scalable and extensible HTTP/1.1 compliant caching proxy server. Congratulations to the whole team in showing a strong and diverse community around this new product.

Next up come three subprojects within the well-known Apache Lucene project which have grown organically from modules within Lucene to be diverse and active projects within their own right. You may recognize some of these product names from the Lucene world.

  • Apache Mahout, which is building a system for creating scalable and effective machine learning libraries which can perform recommendation mining, clustering, classification, and grouping into itemsets.
  • Apache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.
  • Apache Nutch, integratable with both Lucene and Hadoop, adds web-specific crawling, fetching, and organization features.

The Apache Hadoop project – another wildly distributed computing technology – has also grown two of it’s subprojects to the point where they deserve their own fame.

  • Apache Avro is a fast data serialization system that includes rich and dynamic schemas in all it’s processing.
  • Apache HBase is the Hadoop database – designed to provide random, realtime read/write access to Big Data – billions of records – using commodity hardware.

Why did these subprojects spin out to become their own TLPs? The driving factor is not the technology, but rather the community and oversight aspects of how the ASF organizes it’s mostly self-running projects.

From the oversight perspective, the ASF Board relies on every project’s PMC to manage their project’s operations within the broad guidelines of the Apache Way, and to report their project’s progress and issues to the board. This means that there must be enough PMC members who can actively monitor and participate in their project’s activities, and can especially show due diligence and responsibility in voting on any official product releases the project makes. With the rapid growth in both community and technology areas in the Hadoop and Lucene projects, it’s a difficult job for the PMCs to truly understand and help manage all the subprojects they’ve created or added over the past two years.

While the scope of oversight may have hinted that some subprojects should be promoted to TLP status, the gating factor is community. Does a subproject have a strong and diverse enough community to provide their own, independent PMC that can manage their own affairs? Becoming a TLP is both a benefit and a responsibility: the community through it’s new, more focused PMC can better run itself; however the new PMC is also expected to provide accurate reports and responsible oversight of their community and product releases.

Congratulations to all six new projects! Please note that as the websites are updated, each project will be moving it’s home page to http://projectname.apache.org in the near future.