|
||||||||||
PREV NEXT | FRAMES NO FRAMES |
See:
Description
Packages | |
---|---|
org.apache.hadoop.filecache | Deprecated. Use Job instead. |
org.apache.hadoop.mapred | A software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) parallelly on large clusters (thousands of nodes) built of commodity hardware in a reliable, fault-tolerant manner. |
org.apache.hadoop.mapred.jobcontrol | Utilities for managing dependent jobs. |
org.apache.hadoop.mapred.join | Given a set of sorted datasets keyed with the same class and yielding equal partitions, it is possible to effect a join of those datasets prior to the map. |
org.apache.hadoop.mapred.pipes | Hadoop Pipes allows C++ code to use Hadoop DFS and map/reduce. |
org.apache.hadoop.mapred.tools | |
org.apache.hadoop.mapreduce | |
org.apache.hadoop.mapreduce.security | |
org.apache.hadoop.mapreduce.server.jobtracker | |
org.apache.hadoop.mapreduce.tools |
Libraries | |
---|---|
org.apache.hadoop.mapred.lib | Library of generally useful mappers, reducers, and partitioners. |
org.apache.hadoop.mapred.lib.aggregate | Classes for performing various counting and aggregations. |
org.apache.hadoop.mapred.lib.db | org.apache.hadoop.mapred.lib.db Package |
org.apache.hadoop.mapreduce.lib.aggregate | Classes for performing various counting and aggregations. |
org.apache.hadoop.mapreduce.lib.chain | |
org.apache.hadoop.mapreduce.lib.db | org.apache.hadoop.mapred.lib.db Package |
org.apache.hadoop.mapreduce.lib.fieldsel | |
org.apache.hadoop.mapreduce.lib.input | |
org.apache.hadoop.mapreduce.lib.jobcontrol | Utilities for managing dependent jobs. |
org.apache.hadoop.mapreduce.lib.join | Given a set of sorted datasets keyed with the same class and yielding equal partitions, it is possible to effect a join of those datasets prior to the map. |
org.apache.hadoop.mapreduce.lib.map | |
org.apache.hadoop.mapreduce.lib.output | |
org.apache.hadoop.mapreduce.lib.partition | |
org.apache.hadoop.mapreduce.lib.reduce |
Examples | |
---|---|
org.apache.hadoop.examples | Hadoop example code. |
org.apache.hadoop.examples.dancing | This package is a distributed implementation of Knuth's dancing links algorithm that can run under Hadoop. |
org.apache.hadoop.examples.pi | This package consists of a map/reduce application, distbbp, which computes exact binary digits of the mathematical constant π. |
org.apache.hadoop.examples.pi.math | This package provides useful mathematical library classes for the distbbp program. |
org.apache.hadoop.examples.terasort | This package consists of 3 map/reduce applications for Hadoop to compete in the annual terabyte sort competition. |
contrib: DataJoin | |
---|---|
org.apache.hadoop.contrib.utils.join |
contrib: Gridmix | |
---|---|
org.apache.hadoop.mapred.gridmix |
contrib: Index | |
---|---|
org.apache.hadoop.contrib.index.example | |
org.apache.hadoop.contrib.index.lucene | |
org.apache.hadoop.contrib.index.main | |
org.apache.hadoop.contrib.index.mapred |
contrib: Streaming | |
---|---|
org.apache.hadoop.streaming | Hadoop Streaming is a utility which allows users to create and run Map-Reduce jobs with any executables (e.g. |
org.apache.hadoop.streaming.io | |
org.apache.hadoop.typedbytes | Typed bytes are sequences of bytes in which the first byte is a type code. |
Hadoop is a distributed computing platform.
Hadoop primarily consists of the Hadoop Distributed FileSystem (HDFS) and an implementation of the Map-Reduce programming paradigm.
Hadoop is a software framework that lets one easily write and run applications that process vast amounts of data. Here's what makes Hadoop especially useful:
If your platform does not have the required software listed above, you will have to install it.
For example on Ubuntu Linux:
$ sudo apt-get install ssh
$ sudo apt-get install rsync
On Windows, if you did not install the required software when you installed cygwin, start the cygwin installer and select the packages:
First, you need to get a copy of the Hadoop code.
Edit the file conf/hadoop-env.sh to define at least JAVA_HOME.
Try the following command:
bin/hadoopThis will display the documentation for the Hadoop command script.
By default, Hadoop is configured to run things in a non-distributed mode, as a single Java process. This is useful for debugging, and can be demonstrated as follows:
mkdir inputThis will display counts for each match of the regular expression.
Note that input is specified as a directory containing input files and that output is also specified as a directory where parts are written.
JobTracker
(MapReduce master)
host and port. This is specified with the configuration property
mapreduce.jobtracker.address.
(We also set the HDFS replication level to 1 in order to reduce warnings when running on a single node.)
Now check that the command
ssh localhost
does not
require a password. If it does, execute the following commands:
ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
A new distributed filesystem must be formatted with the following command, run on the master node:
bin/hadoop namenode -format
The Hadoop daemons are started with the following command:
bin/start-all.sh
Daemon log output is written to the logs/ directory.
Input files are copied into the distributed filesystem as follows:
bin/hadoop fs -put input input
Things are run as before, but output must be copied locally to examine it:
bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+'When you're done, stop the daemons with:
bin/stop-all.sh
Fully distributed operation is just like the pseudo-distributed operation described above, except, specify:
Finally, list all slave hostnames or IP addresses in your conf/slaves file, one per line. Then format your filesystem and start your cluster on your master node, as above.
|
||||||||||
PREV NEXT | FRAMES NO FRAMES |