What is Apache Mahout? Website Brand Review

Website Brand Review of Apache Mahout

While we’ve all heard about Apache Hadoop, did you know there are over a dozen big data projects at Apache? We host projects that provide everything for your big data stack: databases, storage, streaming, logging, analysis, machine learning, and more. Apache Mahout is one of the pieces that puts a big data stack to do higher-level work for you.

Here’s my quick review of the Apache Mahout project, told purely from the point of view of a new user finding the project website.

Happy Birthday! This month is the Apache Mahout project’s 6th #ApacheBirthday!

What Is Apache Mahout?

“The Apache Mahout™ project’s goal is to build an environment for quickly creating scalable performant machine learning applications.”

While this is a laudable statement – and nicely emphasises the community behind the project – it doesn’t directly say what the software they provide does.

“The three major components of Mahout are an environment for building scalable algorithms, many new Scala + Spark and H2O (Apache Flink in progress) algorithms, and Mahout’s mature Hadoop MapReduce algorithms.”

Continue reading What is Apache Mahout? Website Brand Review

Congratulations to six new Apache projects!

In last week’s monthly meeting of the Board of Directors of the ASF, we approved the creation of six new Top Level Projects (TLPs) at the ASF. This is the most new TLPs ever created at once, followed only by the meeting of November, 2008 where 5 new TLPs were created (CouchDB, Buildr, the Attic, Qpid, and Abdera).

In this particular case, much of the growth comes from within existing projects, wherein subprojects communities within Hadoop and Lucene have matured sufficiently to deserve to manage their own fates, and to create their own Project Mangement Committees (PMCs) to take charge. To put this in another perspective, this is also reflective of the ASF’s growth; before this meeting we had over 70 TLPs and over 30 Incubator podlings, so an addition of 6 new TLPs is less than 10% growth for the month.

We should congratulate the Apache Traffic Server community first, since they went through the Incubation process and successfully graduated from an Incubator Podling into their own TLP. Soon to be served (once the website migration is complete) from http://trafficserver.apache.org/, Apache Traffic Server is fast, scalable and extensible HTTP/1.1 compliant caching proxy server. Congratulations to the whole team in showing a strong and diverse community around this new product.

Next up come three subprojects within the well-known Apache Lucene project which have grown organically from modules within Lucene to be diverse and active projects within their own right. You may recognize some of these product names from the Lucene world.

  • Apache Mahout, which is building a system for creating scalable and effective machine learning libraries which can perform recommendation mining, clustering, classification, and grouping into itemsets.
  • Apache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.
  • Apache Nutch, integratable with both Lucene and Hadoop, adds web-specific crawling, fetching, and organization features.

The Apache Hadoop project – another wildly distributed computing technology – has also grown two of it’s subprojects to the point where they deserve their own fame.

  • Apache Avro is a fast data serialization system that includes rich and dynamic schemas in all it’s processing.
  • Apache HBase is the Hadoop database – designed to provide random, realtime read/write access to Big Data – billions of records – using commodity hardware.

Why did these subprojects spin out to become their own TLPs? The driving factor is not the technology, but rather the community and oversight aspects of how the ASF organizes it’s mostly self-running projects.

From the oversight perspective, the ASF Board relies on every project’s PMC to manage their project’s operations within the broad guidelines of the Apache Way, and to report their project’s progress and issues to the board. This means that there must be enough PMC members who can actively monitor and participate in their project’s activities, and can especially show due diligence and responsibility in voting on any official product releases the project makes. With the rapid growth in both community and technology areas in the Hadoop and Lucene projects, it’s a difficult job for the PMCs to truly understand and help manage all the subprojects they’ve created or added over the past two years.

While the scope of oversight may have hinted that some subprojects should be promoted to TLP status, the gating factor is community. Does a subproject have a strong and diverse enough community to provide their own, independent PMC that can manage their own affairs? Becoming a TLP is both a benefit and a responsibility: the community through it’s new, more focused PMC can better run itself; however the new PMC is also expected to provide accurate reports and responsible oversight of their community and product releases.

Congratulations to all six new projects! Please note that as the websites are updated, each project will be moving it’s home page to http://projectname.apache.org in the near future.