Hadoop revolutionized Big Data with its ecosystem of open source components, however the debate remains – has this revolutionary tool’s best day already passed? With a number of other Hadoop-related products surfacing such as Spark, Hive and Storm, it can leave an aspiring data scientist with more questions than answers on where to start. To best assess the relevancy and staying power of Hadoop, it is important to understand the purpose behind its development and what it is.

What is Hadoop?

Before springing into Hadoop’s beginnings in 2005, let’s address the real burning question – where did that cute little yellow elephant come from? Doug Cutting, co-creator and now Chief Architect at Cloudera, named the tool after his two-year-old’s toy elephant.1 With everyone’s curiosity now at ease, onto Hadoop’s beginnings.

Hadoop was released by the Apache Software Foundation, a non-profit organization that produces open source software powering much of the Internet behind the scenes. The development of Hadoop began when forward-thinking software engineers realized that it is useful to be able to store and analyze data sets far larger than can practically be stored and accessed on one physical storage device (such as a hard disk).2

Benefits of Hadoop

The biggest testament to Hadoop’s staying power is its batch processing. Both Spark and Storm boast real time processing power and incredible efficiencies around the ETL (Extract, Transform, Load) step in populating data warehouses; however the most important point to make regarding Hadoop’s staying power is that these same programs are best utilized within the Hadoop ecosystem.

Matei Zaharia, CTO at Databricks, explains that some Data Scientists refer to Hadoop as a whole ecosystem (HDFS, Hive, MapReduce, etc) while others refer to MapReduce in particular.3 As you can probably imagine, this nuance can shape the debate in two entirely different directions.

In conclusion, if we are referring to Hadoop’s entire ecosystem then there is no question in the data science community that it is here to stay and shows no signs of slowing down its adaptation to new market demands. Community is the operative word here. As long as the Apache remains committed to “providing support for the Apache Community of open-source software projects, which provide software products for the public good”4, then we will continue to experience an ever-evolving, robust Hadoop ecosystem. Any research into the topic will point back to a general piece of advice for prominent figures in the data science community – learn Hadoop, then build your knowledge of other programs within its ecosystem (such as Spark and Storm) from there.

If you are inquisitive with a strong math and statistical background, and are interested in learning more in software such as Hadoop, you might want to consider pursuing a degree in data science.