Recent activity
Subscribe to this feed
Ryan replied on December 15, 2009 20:56 to the idea "How about an API and Email-To-Text Support?" in GOGII:
-
Ryan started following the question "How to run only one Reduce Task and it begins its execution only when all Map Tasks have been completed?" in Cloudera.
Ryan replied on November 10, 2009 03:02 to the question "How to include third-party libraries in a MapReduce job?" in Cloudera:
I just need to ship with my job some python libraries.
In the end I just made a tar/jar and of all the files I needed, like a wordking dir snapshot, and wrapped my original mapper with a bash script to first untar/unjar then execute my python mapper.
My main problem was not realizing that hadoop does not automatically untar or unjar.
Ryan asked a question in Cloudera on November 10, 2009 02:58:
Processing a single file per map using streaming, setting jobconfHere is Hadoop's own answer to the question 'How do I process files, one per map?' :
http://hadoop.apache.org/common/docs/...+do+I+process+files%2C+one+per+map%3F
How do I specify those options in a streaming job? My best guess is to specify them via the jobconf option. If that is the case I'm not sure what to parameters to set.
My goal is to process many small files, possibly from a sequence file, one per map.
Thanks!
Ryan replied on November 04, 2009 18:22 to the question "How to include third-party libraries in a MapReduce job?" in Cloudera:
Todd,
Thanks for the reply. However, it didn't seem to like that. Why would I want to use -libjars over -file?
I also tried putting my mapper.py into the same jar as my library (in hope of keeping them in the same directory) but I am still getting ImportErrors.
---
Here's the output with the -libjars flag:
---
java.lang.RuntimeException:
at org.apache.hadoop.streaming.StreamJob.fail(StreamJob.java:559)
at org.apache.hadoop.streaming.StreamJob.exitUsage(StreamJob.java:496)
at org.apache.hadoop.streaming.StreamJob.parseArgv(StreamJob.java:213)
at org.apache.hadoop.streaming.StreamJob.go(StreamJob.java:115)
at org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:52)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
Thanks,
Ryan
Ryan asked a question in Cloudera on November 04, 2009 09:11:
How to include third-party libraries in a MapReduce job?What's the best way to include third-party dependencies in a streaming job?
I have a couple of library dependencies (Python Imaging Library and a custom library) for my (python) streaming job.
I've tried issuing a -file libs/ with no luck. I've also started looking into Maven.
Thanks,
Ryan
Ryan replied on November 02, 2009 22:41 to the question "Configuring the port numbers on OS X not working" in Cloudera:
Ryan asked a question in Cloudera on November 01, 2009 20:53:
Configuring the port numbers on OS X not workingI've finally gotten 0.18 running on my MacBook but am seeing some discrepancies.
My hadoop-site.xml contains:
...
<property>
<name>fs.default.name</name>
<value>hdfs://Ryans-MacBook.local:9000</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
<property>
<name>mapred.job.tracker</name>
<value>Ryans-MacBook.local:9001</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
...
However, I can only access the NameNode and JobTracker at ports 50070 and 50030. Why is that? Also, it seems that when I use localhost rather than Ryans-MacBook.local, I get a bunch of connection errors when I do a bin/start-all.sh. By the way the window title of the NameNode browser is set to "Hadoop Namenode 10.0.1.8:9000".
These are a few of the issues I've had which I can only guess are due to slight differences in OS X vs Linux.
Thanks,
Ryan
Loading Profile...
