Website Brand Review of Apache Hive
While we’ve all heard about Apache Hadoop®, did you know there are over a dozen big data projects at Apache? We host projects that provide all the different functions your big data stack: databases, storage, streaming, logging, analysis, and more. Apache Hive™ is one of these pieces of the whole big data ecosystem.
Here’s my quick review of the Apache Hive project, told purely from the point of view of a new user finding the project website.
What Is Apache Hive?
“The Apache Hive ™ data warehouse software facilitates querying and managing large datasets residing in distributed storage”.
“Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.”
No, Really, What Is Apache Hive For?
Hive is built on top of Hadoop, so provides simple access to Hadoop functionality while making it easier to query or extract your data, as well as simpler ways to program analysis and mapreduce jobs. In particular, Hive provides the QL language, similar to SQL, that lets you perform your data operations and your mapreduce and analysis all together. QL can be extended with a variety of analysis functions, and can work on a variety of underlying data formats and backends, including custom ones.
Hive specializes in scalability, extensibility, and ease of programming access to both data and analysis. Use Hive for complex tasks you can run in a batch job, where you need fault-tolerance and ability to use various analysis and data formats at the same time.
New User Website Perceptions
That is, what does a new user see “above the fold” when first coming to the Apache Hive project homepage? For their first impression, is it easy to find things, and is the design appealing and easy to follow?
The homepage features a distinctive yellow striped stinging elephant logo, but otherwise features a fairly sparse overview and barely fills in the browser page. The navbar provides a categorized list of all the expected pages for Documentation, Community, Development, with well-named links for every major topic about contributing to the project. While there is a prominent link to Getting Started, the homepage itself really doesn’t explain much more than that.
The navbar also links to the PMC Bylaws, Editing the Website, How To Release, and a great Becoming a Committer guide. These are important topics for more advanced contributors, which often are harder to find on many other sites. Most Hive content is on their wiki, which includes the logo and some basic navbars, but doesn’t have an obvious direct way to navigate back and forth to different topic areas. The wiki content tends to be structured very hierarchically, so there are good outlines/tables of contents, and plenty of details, but not as much background information.
Apache Branding Requirements
Apache projects are expected to manage their own affairs, including making all technical and content decisions for their code and websites. However to ensure a small modicum of consistency – and to ensure users know that an Apache project is hosted at the ASF – there are a few requirements all Apache projects must include in their projectname.apache.org websites (or wikis, etc.)
- Apache Hive is used fairly consistently, and is ™ attributed appropriately on top level pages.
- Website navigation links (except **not** Security!) to ASF pages included in the site’s navigation system.
- Logo does not include TM; footers include a complete trademark attributions including noting other trademark owners.
- DOAP file exists and includes appropriate description and links.
SEO / Search Hits / Related Sites
Well, SEO is far outside of our scope (and debatable in usefulness anyway), but it’s interesting to see: how does a new user find the Apache Hive homepage when they were searching?
Searching for “Hive” (a common word, so we might expect a lot of other hits):
Top hits: varies – either unrelated hits, the project homepage or wikipedia page.
Searching for “Hive software”:
Top hits are typically either homepage or wikipedia page. Several links to other software products named Hive show up as well; most are unrelated. There is a link to Hive on Sourceforge, “a Java software platform for creating distributed applications”, not updated for a long time but listed with a GPL license.
Social Media Presence
The Hive project has some social media presence, but it’s not featured obviously anywhere on the project website.
- https://twitter.com/ApacheHive is the official feed, but not updated recently.
- https://www.facebook.com/apache.hive/ is the official Facebook Page and linked from the homepage, with 3000 likes, but not updated for a long time.
- http://stackoverflow.com/questions/tagged/hive is active and includes an About and link to the project homepage.
There aren’t any obvious LinkedIn or Google+ communities specifically about Hive, although with so many Hadoop-related communities, that’s not surprising.
What Do You Think Apache Hive Is?
So, what do you think? Is Hive still a major player in providing an easier API for your big data batch jobs, or have you moved on? If you’re not using Hive, are you instead using something newer for batch jobs, or are you focusing on streaming data now that Apache Spark and a wide variety of other tools can provide some of the same features, but faster?
Note: I’m writing here as an individual, not wearing any Apache hat. I hope this is useful both to new users and to the Apache Hive community, not necessarily a call to change anything. I haven’t used Hive for any real deployments myself, so please do comment with corrections to anything I’ve messed up above!