This is a summary of the Hadoop Distributed File System white paper, which can be found at this link.
Abstract
- The Hadoop Distributed File System (HDFS) is designed to store very large datasets reliably, and to stream said datasets at a high bandwidth to user applications.
- We describe the architecture of HDFS and report on the experience of using HDFS to manage 25 petabytes of enterprise data at Yahoo!.
Introduction
- Hadoop provides a distributed file system and a framework for the analysis and transformation of very large data sets using the MapReduce paradigm.
- Hadoop is an Apache project; all components are available via the Apache open source license.