These instructions are for installing and running Hadoop on a OS X single node cluster (MacPro).
Step 1: Creating a designated hadoop user on your system
This isn’t entirely necessary, but it’s a good idea for security reasons. To add a user, go to:
System Preferences > Accounts
Click the “+” button near the bottom of the account list. You may need to unlock this ability by hitting the lock icon at the bottom corner and entering the admin username and password.
When the New account window comes out enter a name, as short name and a password. I entered the following:
Name: hadoop
Short name: Hadoop
Password: MyPassword (well you get the idea)
Once you are done, hit “create account”. Now, log in as the hadoop user. You are ready to set up everything!
Step 2: Install/Configure Preliminary Software
Before installing Hadoop, there are a couple things that you need make sure you have on your system.
- Java, and the latest version of the JDK
- SSH
Because OS X is awesome, you actually don’t have to install these things. However, you will have to enable and update what you have. Let’s start with Java:
Updating Java
Open up the Terminal application. If it’s not already on your dock, you can access it through
Applications > Utilities > Terminal
Next check to see the version of Java that’s currently available on the system:
~$ java -version
java version “1.6.0_24″
Java(TM) SE Runtime Environment (build 1.6.0_24-b07-334-10M3326)
Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02-334, mixed mode)
You may want to update this to the latest Java, which is available as an update for OS X. For example, you can download it here: http://support.apple.com/kb/DL1360.
After you download and install the update, you are going to need to configure Java on your system so the default points to this new update. Go to:
Applications > Utilities > Java > Java Preferences
Under “Java Version” hit the radio button next to “Java SE 6″ Down by “Java Application Runtime Settings” change the order so Java SE 6 (64 bit) is first, followed by Java SE 5 (64 bit) and so on. Hit “Save” and close this window.
Now, when you go to the terminal, and type in “java -version” you should get the following:
~$ java -version
java version “1.6.0_24″
Java(TM) SE Runtime Environment (build 1.6.0_24-b07-334-10M3326)
Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02-334, mixed mode)
and for “javac -version”:
~$ javac -version
javac 1.6.0_24
SSH: Setting up Remote Desktop and Enabling Self-Login
SSH also comes installed on your Mac. However, you need to enable access to your own machine (so hadoop doesn’t ask you for a password at inconvenient times). To do this, go to
System Preferences > Sharing (under Internet & Network)
Under the list of services, check “Remote Login”. For extra security, you can hit the radio button for “Only these Users” and select hadoop
Now, we’re going to configure things so we can log into localhost without being asked for a password. Type the following into the terminal:
$:~ ssh-keygen -t rsa -P “”
$:~ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
Now try:
$:~ ssh localhost
You should be able to log in without a problem. You are now ready to install Hadoop. Let’s go to step 3!
Step 3: Downloading and Installing Hadoop
So this actually involves a few smaller steps:
- Downloading and Unpacking Hadoop
- Configuring Hadoop
After we finish these, you should be ready to go! So let’s get started:
Downloading and Unpacking Hadoop
Download Hadoop. Make sure you download the latest version (As of this post, Hadoop 0.21 is the latest version). We call our generic version of hadoop hadoop-* in this tutorial.
Unpack the hadoop-*.tar.gz in the directory of your choice. I placed mine in ~/apps/hadoop-*. You may also want to set ownership permissions for the directory:
$:~ tar -xzvf hadoop-*.tar.gz
$:~ chown -R hadoop hadoop-*
Configuring Hadoop
There are two files that we want to modify when we configure Hadoop. The first is conf/hadoop-env.sh . Open this in your favorite text editor and do the following:
- - Uncomment the export JAVA_HOME line and set it to /Library/Java/Home
- - Uncomment the export HADOOP_HEAPSIZE line and keep it at 2000
You may want to change other settings as well, but I chose to leave the rest of hadoop-env.sh the same. Here is an idea of what part of mine looks like:
# Set Hadoop-specific environment variables here.
# The only required environment variable is JAVA_HOME. All others are
# optional. When running a distributed configuration it is best to
# set JAVA_HOME in this file, so that it is correctly defined on
# remote nodes.
# The java implementation to use. Required.
export JAVA_HOME=/Library/Java/Home
# Extra Java CLASSPATH elements. Optional.
# export HADOOP_CLASSPATH=
# The maximum amount of heap to use, in MB. Default is 1000.
export HADOOP_HEAPSIZE=2000
The next part that we need to set up are several configuration files core-site.xml, hdfs-site.xml, mapred-site.xml. Earlier (before Hadoop 0.21) all these settings were in a single file hadoop-site.xml.
The most important parts to set here are hadoop.tmp.dir (which should be set to the directory of your choice) and to add mapred.tasktracker.maximum property to the file. This will effectively set the maximum number of tasks that can simulataneously run by a task tracker. You should also set dfs.replication ‘s value to one.
Following configuration files are to be created in <hadoop_directory>/conf folder
hdfs-site.xml
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/Users/jiakuanwang/apps/hadoop-0.21.0/temp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
</configuration>
core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
</configuration>
mapred-site.xml
<configuration>
<property>
<name>mapred.jobtracker.address</name>
<value>localhost:9001</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>4</value>
</property>
<property>
<name>mapred.tasktracker.reduce.tasks.maximum</name>
<value>2</value>
</property>
</configuration>
Now to our next step!
Step 4: Formatting and Running Hadoop
This step involves formatting the namenode and testing our system. First of all, we need to set environment variables for Hadoop in ~/.profile file:
# Set Hadoop environment variables
export HADOOP_HOME=/Users/jiakuanwang/apps/hadoop
export PATH=$HADOOP_HOME/bin:$PATH
Use the following command to format file system (new in 0.21.0):
$ hdfs namenode -format
This will give you output along the lines of
$ hdfs namenode -format
11/05/02 11:13:07 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = localhost/127.0.0.1
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 0.21.0
STARTUP_MSG: classpath = …
STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.21 -r 985326; compiled by ‘tomwhite’ on Tue Aug 17 01:02:28 EDT 2010
************************************************************/
11/05/02 11:13:07 INFO namenode.FSNamesystem: defaultReplication = 1
11/05/02 11:13:07 INFO namenode.FSNamesystem: maxReplication = 512
11/05/02 11:13:07 INFO namenode.FSNamesystem: minReplication = 1
11/05/02 11:13:07 INFO namenode.FSNamesystem: maxReplicationStreams = 2
11/05/02 11:13:07 INFO namenode.FSNamesystem: shouldCheckForEnoughRacks = false
11/05/02 11:13:08 INFO security.Groups: Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; cacheTimeout=300000
11/05/02 11:13:08 INFO namenode.FSNamesystem: fsOwner=jiakuanwang
11/05/02 11:13:08 INFO namenode.FSNamesystem: supergroup=supergroup
11/05/02 11:13:08 INFO namenode.FSNamesystem: isPermissionEnabled=true
11/05/02 11:13:08 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
11/05/02 11:13:08 INFO common.Storage: Image file of size 117 saved in 0 seconds.
11/05/02 11:13:08 INFO common.Storage: Storage directory /Users/jiakuanwang/apps/hadoop-0.21.0/temp/dfs/name has been successfully formatted.
11/05/02 11:13:08 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at localhost/127.0.0.1
************************************************************/
Once this is done, we are ready to test our program.
Running Hadoop with Example
Start Hadoop FDS and MapReduce
First, start up the DFS. This will start up a TaskTracker, JobTracker, and DataNode on the machine.
$ start-dfs.sh
starting namenode, logging to /Users/jiakuanwang/apps/hadoop/bin/../logs/hadoop-jiakuanwang-namenode-localhost.out
localhost: starting datanode, logging to /Users/jiakuanwang/apps/hadoop/bin/../logs/hadoop-jiakuanwang-datanode-localhost.out
localhost: starting secondarynamenode, logging to /Users/jiakuanwang/apps/hadoop/bin/../logs/hadoop-jiakuanwang-secondarynamenode-localhost.out
And then start MapReduce:
$ start-mapred.sh
starting jobtracker, logging to /Users/jiakuanwang/apps/hadoop/bin/../logs/hadoop-jiakuanwang-jobtracker-localhost.out
localhost: starting tasktracker, logging to /Users/jiakuanwang/apps/hadoop/bin/../logs/hadoop-jiakuanwang-tasktracker-localhost.out
We can use JPS command to check the running process:
$ jps
17321 Jps
17239 TaskTracker
16996 DataNode
5362
16919 NameNode
17074 SecondaryNameNode
17161 JobTracker
And use the following command to check Hadoop status:
$ hdfs dfsadmin -report
Run WorkCount Program
As input for our test, we are going to create two text file locally
echo "Hello world Bye world" > ~/workspace/hadoop-example/input/file1
echo "hello hadoop bye hadoop" > ~/workspace/hadoop-example/input/file2
Create a folder in DFS:
hadoop fs -mkdir /tmp/input
And then copy the files up to our DFS.
hadoop fs -put ~/workspace/hadoop-example/input/* /tmp/input
# List files in DFS
hadoop fs -ls /tmp/input
Now, it’s time to run the WordCount program:
$ hadoop jar ~/apps/hadoop/hadoop-mapred-examples-0.21.0.jar wordcount /tmp/input /output
11/05/02 13:51:26 INFO security.Groups: Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; cacheTimeout=300000
11/05/02 13:51:26 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
11/05/02 13:51:26 WARN conf.Configuration: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
11/05/02 13:51:26 WARN mapreduce.JobSubmitter: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
11/05/02 13:51:27 INFO input.FileInputFormat: Total input paths to process : 2
11/05/02 13:51:27 WARN conf.Configuration: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
11/05/02 13:51:27 INFO mapreduce.JobSubmitter: number of splits:2
11/05/02 13:51:27 INFO mapreduce.JobSubmitter: adding the following namenodes' delegation tokens:null
11/05/02 13:51:27 WARN security.TokenCache: Overwriting existing token storage with # keys=0
11/05/02 13:51:27 INFO mapreduce.Job: Running job: job_local_0001
11/05/02 13:51:27 INFO mapred.LocalJobRunner: Waiting for map tasks
11/05/02 13:51:27 INFO mapred.LocalJobRunner: Starting task: attempt_local_0001_m_000000_0
11/05/02 13:51:27 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
11/05/02 13:51:27 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
11/05/02 13:51:27 INFO mapred.MapTask: soft limit at 83886080
11/05/02 13:51:27 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
11/05/02 13:51:27 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
11/05/02 13:51:27 INFO mapred.LocalJobRunner:
11/05/02 13:51:27 INFO mapred.MapTask: Starting flush of map output
11/05/02 13:51:27 INFO mapred.MapTask: Spilling map output
11/05/02 13:51:27 INFO mapred.MapTask: bufstart = 0; bufend = 40; bufvoid = 104857600
11/05/02 13:51:27 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214384(104857536); length = 13/6553600
11/05/02 13:51:27 INFO mapred.MapTask: Finished spill 0
11/05/02 13:51:27 INFO mapred.Task: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
11/05/02 13:51:27 INFO mapred.LocalJobRunner: map > sort
11/05/02 13:51:28 INFO mapred.Task: Task 'attempt_local_0001_m_000000_0' done.
11/05/02 13:51:28 INFO mapred.LocalJobRunner: Finishing task: attempt_local_0001_m_000000_0
11/05/02 13:51:28 INFO mapred.LocalJobRunner: Starting task: attempt_local_0001_m_000001_0
11/05/02 13:51:28 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
11/05/02 13:51:28 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
11/05/02 13:51:28 INFO mapred.MapTask: soft limit at 83886080
11/05/02 13:51:28 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
11/05/02 13:51:28 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
11/05/02 13:51:28 INFO mapred.LocalJobRunner:
11/05/02 13:51:28 INFO mapred.MapTask: Starting flush of map output
11/05/02 13:51:28 INFO mapred.MapTask: Spilling map output
11/05/02 13:51:28 INFO mapred.MapTask: bufstart = 0; bufend = 38; bufvoid = 104857600
11/05/02 13:51:28 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214384(104857536); length = 13/6553600
11/05/02 13:51:28 INFO mapred.MapTask: Finished spill 0
11/05/02 13:51:28 INFO mapred.Task: Task:attempt_local_0001_m_000001_0 is done. And is in the process of commiting
11/05/02 13:51:28 INFO mapred.LocalJobRunner: map > sort
11/05/02 13:51:28 INFO mapred.Task: Task 'attempt_local_0001_m_000001_0' done.
11/05/02 13:51:28 INFO mapred.LocalJobRunner: Finishing task: attempt_local_0001_m_000001_0
11/05/02 13:51:28 INFO mapred.LocalJobRunner: Map task executor complete.
11/05/02 13:51:28 INFO mapred.Merger: Merging 2 sorted segments
11/05/02 13:51:28 INFO mapred.Merger: Down to the last merge-pass, with 2 segments left of total size: 61 bytes
11/05/02 13:51:28 INFO mapred.LocalJobRunner:
11/05/02 13:51:28 WARN conf.Configuration: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
11/05/02 13:51:28 INFO mapred.Task: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting
11/05/02 13:51:28 INFO mapred.LocalJobRunner:
11/05/02 13:51:28 INFO mapred.Task: Task attempt_local_0001_r_000000_0 is allowed to commit now
11/05/02 13:51:28 INFO output.FileOutputCommitter: Saved output of task 'attempt_local_0001_r_000000_0' to /output
11/05/02 13:51:28 INFO mapred.LocalJobRunner: reduce > sort
11/05/02 13:51:28 INFO mapred.Task: Task 'attempt_local_0001_r_000000_0' done.
11/05/02 13:51:28 INFO mapreduce.Job: map 100% reduce 100%
11/05/02 13:51:28 INFO mapreduce.Job: Job complete: job_local_0001
11/05/02 13:51:28 INFO mapreduce.Job: Counters: 20
FileInputFormatCounters
BYTES_READ=46
FileSystemCounters
FILE_BYTES_READ=757532
FILE_BYTES_WRITTEN=1036089
HDFS_BYTES_READ=116
HDFS_BYTES_WRITTEN=45
Map-Reduce Framework
Combine input records=8
Combine output records=6
Failed Shuffles=0
GC time elapsed (ms)=19
Map input records=2
Map output bytes=78
Map output records=8
Merged Map outputs=0
Reduce input groups=6
Reduce input records=6
Reduce output records=6
Reduce shuffle bytes=0
Shuffled Maps =0
Spilled Records=12
SPLIT_RAW_BYTES=204
After the job is finished, we can check the output folder:
$ hadoop fs -ls /output
11/05/02 13:56:48 INFO security.Groups: Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; cacheTimeout=300000
11/05/02 13:56:48 WARN conf.Configuration: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
Found 2 items
-rw-r--r-- 1 jiakuanwang supergroup 0 2011-05-02 13:51 /output/_SUCCESS
-rw-r--r-- 1 jiakuanwang supergroup 45 2011-05-02 13:51 /output/part-r-00000
And check the result file:
$ hadoop fs -cat /output/part-r-00000
11/05/02 13:58:01 INFO security.Groups: Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; cacheTimeout=300000
11/05/02 13:58:01 WARN conf.Configuration: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
Bye 1
Hello 1
bye 1
hadoop 2
hello 1
world 2
That’s all for this quick start for developing and running Hadoop jobs on Mac. You can go further from here, good luck!
Recent Comments