Get your own customer support community

Recent activity

Subscribe to this feed
  • question

    Abhijit asked a question in Cloudera on September 29, 2009 17:19:

    Abhijit
    Sqoop for data exports out of HDFS
    Can SQOOP be used to push HDFS data (or hive tables) to relational database?
    It makes total sense to have primary goal to facilitate large data transfer out of dbs. However, as grid is evolving around Hadoop one of the pain point that has got least attention is efficiently getting data out of HDFS. Many latency/SLA dependent apps talk to relational database and its nice to have Hadoop do all crunching and then transfer tiny to moderate amount of data to relational database for downstream apps. And what can be better than having a single tool seamlessly doing it for you? Any thoughts? Or its already there? :p
  • question

    A comment on the question "Insufficient Instance Capacity error" in Cloudera:

    Abhijit
    yeah this one was on c1.medium. The workaround is to change the instance type, pretty much go for higher price type. This happened to me quite often in last 40 days (at least 4 times). Once it was so bad that I had to go all the way up to c1.xlarge to get cluster of 5 nodes. – Abhijit, on September 23, 2009 17:28
  • question

    A comment on the question "Attaching additional nodes and volumes to EBS" in Cloudera:

    Abhijit
    Thanks Michael. I kind of solved this or got around it. Basically, I manually edited .JSON files in ~/.hadoop-ec2. This will give you quite a bit of flexibility. You can change number of slaves, volumes per slave, and even move volumes around. – Abhijit, on September 21, 2009 19:01
  • question

    Abhijit asked a question in Cloudera on September 04, 2009 18:49:

    Abhijit
    hive 0.4? 0.3 has some serious issues....
    I think hadoop-hive RPMs comes with hive 0.3. I installed hive rpms and used it for a while. But very soon I got stalled with bugs in 0.3. These were very simple and frequent queries. Got to know from facebook folks that 0.4 version is stable and has lots of bug fixes and improvements. So manually upgraded to 0.4. I am using it for a while now and it looks pretty good so far.

    so question is: are you guys already have rpms with hive 0.4? If not do you guys have any plans, time lines?
  • problem

    A comment on the problem "multifilesplit is using job default filesystem incorrectly" in Cloudera:

    Abhijit
    Thanks Alex. Totally make sense if guys are close on 0.20.
    We can use a work around for this. This is what I did and worked:
    - Copied and renamed MultiFileSplit.java into our code base
    - Applied trivial patch give in http://issues.apache.org/jira/browse/...
    - You need to also copy and rename MultiFileInputFormat.java. Replace references to MultiFileSplit with your patched class name
    - Jar these files and supply them with -libjars option – Abhijit, on September 03, 2009 20:25
  • problem

    Abhijit reported a problem in Cloudera on September 03, 2009 19:21:

    Abhijit
    multifilesplit is using job default filesystem incorrectly
    We were trying to use MultiFileInputFormat. So extended the class and implemented MultiFileTextInputFormat. But then we ran into following exception:
    java.lang.IllegalArgumentException: Wrong FS: s3n://myfolder, expected: hdfs://ec2-174-129-165-26.compute-1.amazonaws.com
    at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:309)
    at org.apache.hadoop.dfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:84)
    at org.apache.hadoop.dfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:140)
    at org.apache.hadoop.dfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:408)
    at org.apache.hadoop.mapred.MultiFileSplit.getLocations(MultiFileSplit.java:96)
    at org.apache.hadoop.mapred.JobClient.writeSplitsFile(JobClient.java:872)
    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:768)
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1026)
    at com.rocketfuelinc.etl.AdLogETLDriver.run(AdLogETLDriver.java:77)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at com.rocketfuelinc.etl.AdLogETLDriver.main(AdLogETLDriver.java:154)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
    at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
    at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)

    After drilling down on the problem it looks a problem in Hadoop's MultiFileSplit class and is patched in version 0.19 http://issues.apache.org/jira/browse/...

    * long term -- can we get Amazon AMI patched with this?
    * short term -- can we pass patched class jar with -libjars option to hadoop jobs?
  • question

    Abhijit asked a question in Cloudera on August 21, 2009 18:29:

    Abhijit
    Attaching additional nodes and volumes to EBS
    I want to add more nodes and volumes to existing EBS set-up. There is no good documentation available on the site and I have month worth of data on EBS, so don't want to screw this up. This is what I have and planning to do.
    - CLUSTER=my-ebs-cluster
    - SPEC-FILE=my-ebs-cluster-storage-spec.json. Has two mounts for master and two for slave ROLE
    - Current Storage (for example): 2 volumes for master and 6 volumes (3x2) for 3 slaves
    - Want to add 2 more slaves with 2 volumes each (i.e. 5 slaves and 10 volumes)
    Steps:
    1. hadoop-ec2 create-storage my-ebs-cluster slave 2 my-ebs-cluster-storage-spec.json
    2. hadoop-ec2 launch-cluster my-ebs-cluster 5

    Is that it?
    - Do I need to use attach-storage option?
    - Why attach-storage is not taking volume id or something? How it works?
  • star

    Abhijit marked one of Alex Loddengaard's replies in Cloudera as useful. Alex Loddengaard replied to the question "getting status of launched cluster".

  • question

    A comment on the question "getting status of launched cluster" in Cloudera:

    Abhijit
    Thanks much Alex. Gosh I might have used that cmd several times before. The only excuse I have is that you guys are spoiling us with your great help :p:-) – Abhijit, on August 20, 2009 16:47
  • star
  • star

    Abhijit marked one of Tom's replies in Cloudera as useful. Tom replied to the question "ec2 ebs configuration for various cases".

  • question

    A comment on the question "ec2 ebs configuration for various cases" in Cloudera:

    Abhijit
    Thanks Alex and Tom. I think in next few days we will have even better idea about our workload and different things people want to do. I will post any questions or interesting findings. – Abhijit, on August 20, 2009 16:44
  • question

    Abhijit asked a question in Cloudera on August 18, 2009 01:38:

    Abhijit
    getting status of launched cluster
    after we launch a cluster using hadoop-ec2 command is there a way to know if cluster is running any jobs or is sitting idle?

    We have a hive-client which launches a ec2 cluster when hive user creates a session and shuts it down when they delete the session. We are trying to make sure we shut down cluster after certain reasonable timeout even if user forgot to delete the session
  • question

    Abhijit asked a question in Cloudera on August 18, 2009 01:29:

    Abhijit
    ec2 ebs configuration for various cases
    we use ec2 EBS distro. The way I have set it up is very much like in the tutorial on the EBS page. Two EBS volumes for master and slave in the json file. Created stores for 1 master and 3 slaves. To add, we typically bring cluster up, do work for 5-10hrs, and shut it down.

    I am trying to understand how this works in following cases:
    (a) If we want to launch a cluster with more than 3 slaves, say 10, once in a while
    (b) We want to have more disk space in hdfs
    (c) We want to start multiple clusters with different node capacity.

    it will nice to know what is best way to configure things to address these cases smoothly...
  • question
  • star

    Abhijit marked one of Aaron's replies in Cloudera as useful. Aaron replied to the question "Insufficient Instance Capacity error".

  • question

    Abhijit asked a question in Cloudera on August 12, 2009 18:57:

    Abhijit
    Insufficient Instance Capacity error
    We are using Cloudera Distribution AMI for Hadoop with EBS integration. The step-up was working fine for last few days. Today when I try to start the cluster I am getting InsufficientInstanceCapacity error. Seems like problem coming form Boto server and might have to do with EC2 as such. Just want to make sure and to know if anyone has any thoughts on this.

    Traceback (most recent call last):
    File "/home/me/apps/hadoop-ec2/hadoop-ec2", line 150, in <module>
    opt.get('env'))
    File "/home/me/cloudera-for-hadoop-on-ec2-py-0.2.0-beta/hadoop/ec2/commands.py", line 69, in launch_master
    return mkarg(os.path.join(head, x))
    File "/home/me/cloudera-for-hadoop-on-ec2-py-0.2.0-beta/hadoop/ec2/cluster.py", line 168, in launch_instances
    File "/usr/lib/python2.6/site-packages/boto/ec2/connection.py", line 353, in run_instances
    return self.get_object('RunInstances', params, Reservation, verb='POST')
    File "/usr/lib/python2.6/site-packages/boto/connection.py", line 569, in get_object
    response = self.make_request(action, params, path, verb)
    File "/usr/lib/python2.6/site-packages/boto/connection.py", line 540, in make_request
    return self._mexe(verb, qs, request_body, headers)
    File "/usr/lib/python2.6/site-packages/boto/connection.py", line 396, in _mexe
    raise BotoServerError(response.status, response.reason, body)
    boto.exception.BotoServerError: BotoServerError: 500 Internal Server Error
    <?xml version="1.0"?>
    <response><errors><error>InsufficientInstanceCapacity<message>Insufficient capacity.</message></error></errors><requestid>c99b3591-95bd-4bd9-84a3-987d884af214</requestid></response></module>
  • question
  • star

    Abhijit marked one of Tom's replies in Cloudera as useful. Tom replied to the question "using s3 as a replacement to hdfs for EC2 jobs".

  • question

    A comment on the question "using s3 as a replacement to hdfs for EC2 jobs" in Cloudera:

    Abhijit
    Between I was referring to S3 block filesystem mentioned on http://wiki.apache.org/hadoop/AmazonS3
    and using it as a replacement of HDFS as said in one of the section on this page – Abhijit, on August 04, 2009 01:31
next » « previous