Work like clockwork
I was always puzzled with tutorials and courses on ‘How to install X’ as they do talk about ‘how to install X’ but in many cases, it is all about running X locally and manually, and the authors hardly mention getting X into production. The assumptions of this article that: (1) you need to extract data from different sources in batches; (2) you already did your research and selected Talend — in my case, it is Talend Open Studio — or Pentaho-in my cases, Pentaho Data Integration; (3) you are moving the mode of your ETLs from development or test mode to production; (4) you use an ubuntu server to run the production versions of Talend / Pentaho; (5) you’ve already installed the prerequisites like Pentaho and Java on the server; (6) you set up all environmental variables.
Set up the Cron Job
To run the ETL jobs on a schedule, I used a crontab, a file which contains the schedule of cron entries to be run and at specified times. Each entry of the file follows a particular format as a series of fields, separated by spaces and / tabs. For example, I wanted to run my ETL script every day at 11.30pm, thus, the format of the cron statement was: 30 23 * * * command.
To run ETL jobs as a non-root user, type crontab -e.
Otherwise, usesudo crontab -e.
Schedule Pentaho with Crontab
In my case, Pentaho was installed in the opt directory -/opt/pentaho/data-integration
while Pentaho jobs and transformations were located in the home directory -/home/ubuntu/repository/project/content_pdi.
Thus, the crontab looked like:
30 23 * * * /opt/pentaho/data-integration/kitchen.sh -file=/home/ubuntu/repository/project/content_pdi/project_1.kjb
Schedule Talend with Crontab
When building my Talend job, I’ve extracted executable files in the folder-/home/ubuntu/repository/project/content_talend.
Thus, the crontab looked like:
30 23 * * * /bin/sh /home/ubuntu/repository/project/content_talend/project_1_run.sh
Reload the crontab to apply the changes, using:
sudo service cron restart
And you are done! I hope the article will help you to get a working prototype / to configure your ETL in the production mode.