Hadoop: 7 Elementary Questions Answered

May 22, 2015

Q: Where did the crazy Hadoop name come from?
A: It is the name of a pet elephant belonging to one of the Founder’s sons. In 2002 Doug Cutting (the one with the son and the elephant) was working at Internet Archive (a non-profit digital library with a mission of universal access to all knowledge) when he teamed up with University Of Washington grad student, Mike Cafarella, for the goal of building a better open-source search engine.

Q: What does MapReduce and Google file system have to do with Hadoop?
A: When the white papers for these two products were released in 2003 and 2004 they provided the foundation for the development of Hadoop. MapReduce solved the problem of system crashes or faults by using multiple commodity machines to process the greater and greater amounts of data flowing through the internet. MapReduce created the ability to take a query over a dataset, divide it, and run it in parallel over multiple cluster nodes. Google File System or as it was renamed in the Hadoop architecture, Hadoop Distributed File System (HDFS), allowed Hadoop to store data on these same commodity machines with very high aggregate bandwidth across the cluster.

Q: What is a Hadoop cluster?
A: A computational cluster designed specifically for storing and analyzing huge amounts of unstructured data on low cost commodity computers. These clusters are called “shared nothing” systems because the only thing that is shared between nodes is the network that connects them. They are highly scalable: If a cluster’s processing power is overwhelmed with data, additional nodes can be added to improve processing. Because of this sharing, where each piece of data is copied onto other cluster nodes, the data is not lost if one node fails.

Q: What does Yahoo have to do with Hadoop?
A: In 2006 Yahoo hired Hadoop Founder, Cutting to help it develop its own open source technologies to compete with Google. Yahoo wanted to be a place where skilled data scientists and engineers would want to come to work. The 2006 version of Hadoop was nowhere near ready to handle production search workloads at the level of data coursing through the internet; but Yahoo was determined to scale Hadoop as far as it needed to go, and it continued investing heavy resources into it.

Q: How did the Data Science focus of Hadoop happen?
A: A critical step in Hadoop’s development was Yahoo’s decision to set up a research grid for their data scientists. They started working with dozens of nodes in Hadoop which quickly grew to hundreds and thousands of nodes as more data was added and the technology matured. “This very quickly kind of exploded and became our core mission,” Eric Baldeschwieler, Yahoo Search Engine Chief at the time, said, “because what happened is the data scientists not only got interesting research results – what we had anticipated – but they also prototyped new applications and demonstrated that those applications could substantially improve Yahoo’s search relevance or Yahoo’s advertising revenue.”

Q: What happened on June 10, 2009?
Yahoo made the source code for the version of Hadoop it runs in production available to the public. Yahoo contributes all the work it does to the open-source community. Its developers fix bugs, provide stability improvements and release the patch source code to the public. Why does Yahoo do this? They want Hadoop to avoid the fragmentation that happened to the UNIX operating system. By making sure all developments and changes to Hadoop go back to the independent, non-profit, Apache Software Foundation, Hadoop can overcome competitive struggles and be the de facto platform for storing large amounts of data.

Q: When did business beyond the Yahoo community begin using Hadoop?
A: Cloudera was the first commercial Hadoop company and many CIOs and non-web engineers were first introduced to Hadoop by Cloudera. In 2011 Yahoo spun off Hortonworks as a separate company to build Hadoop software, sell Hadoop services and take over Yahoo’s role as the unofficial steward of the Hadoop project. Hortonworks worked closely with companies like Microsoft, Teradata and Rackspace to develop their internal Hadoop knowledge and to create products utilizing Hortonworks’ technology. While these large web companies can manage the operational concerns of huge data processing, a corporate IT shop will want turnkey solutions. These turnkey solutions are only now coming to market. Baldeschwieler who went on to work for Hortonworks said, “…Ultimately, people don’t adopt Hadoop because it’s the best solution for processing small data. They adopt Hadoop because it’s demonstrated that it solves the biggest problems and they won’t outgrow it.”

Much of this information is from GigaOM’s excellent blog on The History of Hadoop. Interested in being a data scientist? Contact Executive Recruiter, Jacque Paige, Smith Hanley Associates, jpaige@smithhanley.com.

Share it

Our BLOG

Hadoop: 7 Elementary Questions Answered

Related Posts

Actuarial Exam Tips and Strategies for Success

2024 Pharmaceutical Industry Recruiting

Entertainment Industry Data Science

AI In Market Research