The Cascading framework has its own workflow management system embedded in it, so when I tried to find information online about how to launch a Cascading job from within the Apache Oozie workflow scheduler tool, I found a dearth of information.
In fact, when I asked on the oozie-users mailing list how to do it, the only response I got back was to write an Oozie extension to run Cascading jobs. That may be the right solution long term (don't know enough yet), but I did find a way to get it working with what Oozie provides today.
/*---[ Failed attempts ]---*/
I tried unsuccessfully to use the map-reduce action and the shell action. The former won't work because it wants you to specify the Mapper and Reducer classes explicitly. That doesn't make sense in a Cascading job - you launch your main Cascading class and it auto-generates a bunch of mappers and reducers. Secondly, while you can use the oozie.launcher.action.main.class
property and specify your main Cascading class, there seems to be no way to pass arguments to it.
I'm not sure why I couldn't get the shell action to work. I made the exec property /usr/bin/hadoop
in order to run it as hadoop jar myjar.jar com.mycompany.MyClass arg1 arg2 argN
, but several attempts to make that work failed. There probably is a way to make it work, however.
/*---[ Solution: use the java action ]---*/
In order to launch Cascading jobs, we build an uber-jar (which maven annoyingly calls a shaded jar) that has our specific Cascading code and supporting objects, as well as the Cascading library all bundled in it. But that's not enough as all that depends on the myriad Hadoop jars. We then use the hadoop jar
invocation as I indicated above because it puts all the Hadoop jars in the classpath.
I didn't think using the Oozie java action would work unless I built a massive uber jar with all the Hadoop dependencies which then have to get farmed around the Hadoop cluster each time you run it -- a great waste.
But I was happily surprised to notice that Oozie sets up the classpath for java (and map-reduce) tasks with all the Hadoop jars present.
So, here's the workflow.xml
file that works:
<workflow-app xmlns='uri:oozie:workflow:0.2' name='cascading-wf'> <start to='stage1' /> <action name='stage1'> <java> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.job.queue.name</name> <value>${queueName}</value> </property> </configuration> <main-class>com.mycompany.MyCascade</main-class> <java-opts></java-opts> <arg>/user/myuser/dir1/dir2</arg> <arg>my-arg-2</arg> <arg>my-arg-3</arg> <file>lib/${EXEC}#${EXEC}</file> <capture-output /> </java> <ok to="end" /> <error to="fail" /> </action> <kill name="fail"> <message>FAIL: Oh, the huge manatee!</message> </kill> <end name="end"/> </workflow-app>
The parameterized variables, such as ${EXEC}
, are defined in a job.properties in the same directory as the workflow.xml
file. The shaded jar is in a lib subdirectory as indicated.
nameNode=hdfs://10.230.138.159:8020 jobTracker=http://10.230.138.159:50300 queueName=default oozie.wf.application.path=${nameNode}/user/${user.name}/examples/apps/cascading EXEC=mybig-shaded-0.0.1-SNAPSHOT.jar
Let me know if you find another way to launch a Cascading job from Oozie or find any problems with this solution.