Volunteering at the ASF and elsewhere in open source, I think a lot about open source brands. In particular: how do various open source projects – run by a wide variety of typically very geeky volunteers – present themselves publicly to new users? We sometimes spend so much time working on the great new code – and explaining it to other developers we already know – that sometimes I wonder if we’re really showcasing what our great new code can do for new users and contributors.
Here’s my quick review of the Apache Spark project, told purely from the point of view of a new user who just came to the project website. I’m trying to show what I think someone new to the project might think about the project once they get to the homepage. Since Spark is a major project in the big data space, there are a lot of search hits for Spark, including a wide variety of other software vendors.
What Is Apache Spark?
Apache Spark is a high level framework for executing transformations and actions on distributed datasets hosted across a variety of sources. It includes Python, Scala, Java, and other language APIs allowing users to write their big data analysis jobs easily. Spark includes built in libraries for accessing SQL, Hadoop/HDFS, and other cloud/distributed datasets, as well as built-in libraries with machine learning, streaming, and graphing functions.
For users not really familiar with how big data works, the homepage description is still pretty generic: “a fast and general engine for large-scale data processing”, although that kind of description goes with the big data territory.
No, Really, What Is Apache Spark For?
Many people are saying “Spark is the next Hadoop”. Spark provides the ability to easily and quickly run a wide variety of transformations or analyses on large datasets from common sources. Spark is often used atop Hadoop’s HDFS layer for storing or providing access to data, but Spark may provide simpler ways to actually program the analysis, mapreduce, or other transformations you want to apply to your big data. Spark also has direct integration with streaming and machine learning libraries, often making integration with Spark easier than integrating a bunch of different tools yourself atop a Hadoop cluster.
New User Website Perceptions
What might a new user see “above the fold” when first coming to the Apache Spark project homepage? For their first impression, is it easy to find things, and is the design appealing and easy to follow?
Personally, I think the homepage clearly and concisely describes Spark functionality, including some key points – speed, ease of use, broad applicability. Examples, FAQ, Documentation, and Download links are all prominently featured above the fold. Getting Started and Contributing are both featured below the fold.
UI design/style is simple but integrates well with the Spark logo, and includes a few charts, a simple to understand code example, as well as a News section with some recent updates. UI design is consistent across all major subpages, although the Documentation section uses a separate, more compact header design.
Apache Branding Requirements
Apache projects are expected to manage their own affairs, including making all technical and content decisions for their code and websites. However to ensure a small modicum of consistency – and to ensure users know that every Apache project is hosted at the ASF – there are a few requirements all Apache projects must include in their projectname.apache.org websites (or wikis, etc.). These are both to ensure that the ASF and the project can defend their trademarks, as well as to help show the broader community of developers and users that spans all Apache projects.
- Apache Spark is used in the homepage headline, but not used on many other pages (i.e. many only say “Spark” everywhere on the page).
- Website navigation links to ASF pages are not included.
- Footer includes trademark attributions and link.
- DOAP file registered at projects.apache.org
- Powered By Spark page on wiki is detailed and organized by users vs. third party supplemental/related software products, which is good to see.
There are some obvious sub-brands of products/modules from the PMC that are included within the Spark software product itself: Spark SQL, Spark Streaming, MLib, GraphX.
SEO / Search Hits / Related Sites
SEO is far outside of our scope (and debatable in usefulness anyway), but it’s interesting to see: would a new user actually find the Apache Spark homepage when they were searching?
Searching for “spark software”:
Top hit: wikipedia
Second hit: homepage
Other hits: other Spark named software
Searching for “spark big data”:
Many vendor pages, tutorials, as well as wikipedia entry and our homepage.
We alsoÂ find many software products named Spark from other organizations:
- Spark Framework: a micro Java web framework.
- Ignite Realtime Spark: an IM client.
- Cisco Spark: an app for team video messaging.
- Baidu Spark: a Chinese web browser fork of Chromium.
- Spark-2014: A mathematical / research security software language & techniques.
Other major uses of our Spark brand in domain names (i.e. these websites are specifically talking about Apache Spark software, but the domains are owned by other organizations):
https://spark-summit.org/ an annual conference.
http://spark-packages.org/ a “community index of packages for Apache Spark”
Apache Spark Social Media Presence
Many open source projects have a social media presence – although sometimes not as polished or consistent a presence as a commercial brand presence would have – which goes with the territory of having volunteers organize everything. Here are the top social media accounts I found in a quick search.
http://stackoverflow.com/questions/tagged/apache-spark tag is listed on the Spark website along with mailing lists as an unofficial place to get questions answered.
https://twitter.com/apachespark found but not linked, I presume official (i.e. the PMC seems to be running this account) but not very active.
https://www.linkedin.com/groups/7403611/profile found but not linked, I presume official.
https://plus.google.com/+TheApacheSpark found but not linked, and uses an old link to Incubator podling site!
What Do You Think Apache Spark Is?
So, what do you think? Is Spark really going to be the next Hadoop? Is the concept of a single software product providing your entire big data stack a good thing, or not really practical? How well do you think the Apache Spark project does at promoting itself – or any other Apache project, for that matter?
Note: I’m writing here as an individual, not wearing any Apache hat. So this is just some information that I hope might be useful to the Apache Spark community, not a requirement to change anything. I haven’t used Spark for any real deployments myself, so please do comment with corrections to anything I’ve messed up above!