What is Apache Mahout? Website Brand Review

Website Brand Review of Apache Mahout

While we’ve all heard about Apache Hadoop, did you know there are over a dozen big data projects at Apache? We host projects that provide everything for your big data stack: databases, storage, streaming, logging, analysis, machine learning, and more. Apache Mahout is one of the pieces that puts a big data stack to do higher-level work for you.

Here’s my quick review of the Apache Mahout project, told purely from the point of view of a new user finding the project website.

Happy Birthday! This month is the Apache Mahout project’s 6th #ApacheBirthday!

What Is Apache Mahout?

“The Apache Mahout™ project’s goal is to build an environment for quickly creating scalable performant machine learning applications.”

While this is a laudable statement – and nicely emphasises the community behind the project – it doesn’t directly say what the software they provide does.

“The three major components of Mahout are an environment for building scalable algorithms, many new Scala + Spark and H2O (Apache Flink in progress) algorithms, and Mahout’s mature Hadoop MapReduce algorithms.”

No, Really, What Is Apache Mahout For?

Mahout offers a specific set of fairly detailed functionality for training machine learning with big data corpus; something that’s hard to describe unless you are familiar with large scale machine learning. Mahout is essentially an environment or framework that provides building blocks for defining and executing machine learning algorithms that can process data at scale with very simple programming. There are two important sides to the Mahout framework:

  • Engine / data source features: you can run Mahout analyses easily on a single data set, MapReduce jobs, Spark, H20, or Flink data sources. Mahout provides the pre-made connections that can tie both data source input/output as well as distribution features using normal Hadoop and MapReduce techniques for you automatically.
  • Machine learning algorithms: Mahout comes with a solid range of implemented algorithms you can simply process and train with your data. In particular, the newly announced Samsara environment provides a richer way to express math concepts along with your algorithms, as well as providing performance gains when running your jobs on various data sources and types. Mahout provides premade Classification, Clustering, Recommendation, Statistical, and other algorithms you can just start using.

New User Website Perceptions

That is, what does a new user see “above the fold” when first coming to the Apache Mahout project homepage? For their first impression, is it easy to find things, and is the design appealing and easy to follow?

The homepage features a clean layout with search box, project navbar, and a sidebar with Apache-wide links, including a useful “Related Projects” listing. There are prominent links to a variety of technical information, as well as a navbar Developers listing that points to the usual how-to topics (get source, build, contribute, release, etc.) There is no obvious “Get Started” link or section, although many of the technical topics feature their own FAQ or intro section that includes specific examples for using those features.

The attractive Mahout logo provides the obvious association to the Hadoop elephant, and a FAQ explains the connection. The website has consistent navigation and styling. While much of the drill-down content (algorithm definitions and examples) is highly technical, some overview or introductory pages have a welcoming style. How to Contribute and similar pages also provide friendly advice or more information about the hows and whys for newcomers, which is great to see on such a highly technical project. There is also a Reference Reading page and a listing of books available, both of which provide pointers (if a little hidden) helping people understand the math and theory behind machine learning algorithms.  Similarly Powered By and “professional support” pages are well laid out.

The Developer resources landing page provides a link to board reports, which is highly useful to help committers understand governance around Apache projects, however the listing stops in 2015. While the overall content organization of the site isn’t obvious to me (not being a machine learning developer!), there is a lot of information both on theory and specific examples spread across the website, so it should be of value for users.

Apache Branding Requirements

Apache projects are expected to manage their own affairs, including making all technical and content decisions for their code and websites. However to ensure a small modicum of consistency – and to ensure users know that an Apache project is hosted at the ASF – there are a few requirements all Apache projects must include in their projectname.apache.org websites (or wikis, etc.)

  • Apache Mahout is used on the homepage and announcements, but is usually shortened in other places.
  • Website navigation links to ASF pages included in the site’s navigation system.
  • Logo does not include TM; footers include only an Apache trademark attribution.
  • DOAP file exists, but only with basic info and an old release.

SEO / Search Hits / Related Sites

Well, SEO is far outside of our scope (and debatable in usefulness anyway), but it’s interesting to see: how does a new user find the Apache Mahout homepage when they were searching?

Searching for “Mahout” :
Top hits: varies; either project homepage/wikipedia, or definitions for elephant rider.

Searching for “Mahout software”:
Top hits are the project pages, and then some other articles or corporate pages about Apache Mahout, either technology or machine learning impact.

Social Media Presence

The Mahout project has a prominent Twitter feed.

What Do You Think Apache Mahout Is?

So, what do you think? Is Mahout the go-to platform for machine learning?  If so, is it because of the data and scale/MapReduce integration, or because of the pre-implemented algorithm sets?  Is Mahout flexible enough for you to build your recommender engine on top of it directly, or does your work truly need a more integrated framework to get started?

Note: I’m writing here as an individual, not wearing any Apache hat. I hope this is useful both to new users and to the Apache Mahout community, not necessarily a call to change anything. I haven’t used Mahout for any real deployments myself, so please do comment with corrections to anything I’ve mixed up above!

What do you think?