Friday, October 11, 2013

Launching a Cascading job from Apache Oozie

The Cascading framework has its own workflow management system embedded in it, so when I tried to find information online about how to launch a Cascading job from within the Apache Oozie workflow scheduler tool, I found a dearth of information.

In fact, when I asked on the oozie-users mailing list how to do it, the only response I got back was to write an Oozie extension to run Cascading jobs. That may be the right solution long term (don't know enough yet), but I did find a way to get it working with what Oozie provides today.


/*---[ Failed attempts ]---*/

I tried unsuccessfully to use the map-reduce action and the shell action. The former won't work because it wants you to specify the Mapper and Reducer classes explicitly. That doesn't make sense in a Cascading job - you launch your main Cascading class and it auto-generates a bunch of mappers and reducers. Secondly, while you can use the oozie.launcher.action.main.class property and specify your main Cascading class, there seems to be no way to pass arguments to it.

I'm not sure why I couldn't get the shell action to work. I made the exec property /usr/bin/hadoop in order to run it as hadoop jar myjar.jar com.mycompany.MyClass arg1 arg2 argN, but several attempts to make that work failed. There probably is a way to make it work, however.


/*---[ Solution: use the java action ]---*/

In order to launch Cascading jobs, we build an uber-jar (which maven annoyingly calls a shaded jar) that has our specific Cascading code and supporting objects, as well as the Cascading library all bundled in it. But that's not enough as all that depends on the myriad Hadoop jars. We then use the hadoop jar invocation as I indicated above because it puts all the Hadoop jars in the classpath.

I didn't think using the Oozie java action would work unless I built a massive uber jar with all the Hadoop dependencies which then have to get farmed around the Hadoop cluster each time you run it -- a great waste.

But I was happily surprised to notice that Oozie sets up the classpath for java (and map-reduce) tasks with all the Hadoop jars present.

So, here's the workflow.xml file that works:

<workflow-app xmlns='uri:oozie:workflow:0.2' name='cascading-wf'>
  <start to='stage1' />
  <action name='stage1'>
    <java>
      <job-tracker>${jobTracker}</job-tracker>
      <name-node>${nameNode}</name-node>

      <configuration>
        <property>
          <name>mapred.job.queue.name</name>
          <value>${queueName}</value>
        </property>
      </configuration>

      <main-class>com.mycompany.MyCascade</main-class>
      <java-opts></java-opts>
      <arg>/user/myuser/dir1/dir2</arg>
      <arg>my-arg-2</arg>
      <arg>my-arg-3</arg>
      <file>lib/${EXEC}#${EXEC}</file> 
      <capture-output />
    </java>
    <ok to="end" />
    <error to="fail" />
  </action>


  <kill name="fail">
    <message>FAIL: Oh, the huge manatee!</message>
  </kill>

  <end name="end"/>
</workflow-app>

The parameterized variables, such as ${EXEC}, are defined in a job.properties in the same directory as the workflow.xml file. The shaded jar is in a lib subdirectory as indicated.

  
 nameNode=hdfs://10.230.138.159:8020
 jobTracker=http://10.230.138.159:50300
  
 queueName=default
  
 oozie.wf.application.path=${nameNode}/user/${user.name}/examples/apps/cascading
 EXEC=mybig-shaded-0.0.1-SNAPSHOT.jar

Let me know if you find another way to launch a Cascading job from Oozie or find any problems with this solution.

1 comment:

  1. Somewhat more detailed post on how to submit Cascading jobs on Oozie, featuring Gradle Shadow plugin:
    http://pannoniancoder.blogspot.com/2014/06/running-cascading-hadoop-jobs-via-cli.html

    ReplyDelete