bisdw
-
A simple ETL tool
Version 0.2.0
2013-10-06
1 Introduction
Bisdw is a simple ETL tool that is target to extract data from different source. It can use different embedded ways to do the extraction from different kind of databases, files, etc. In addition Bisdw support FTP based transfer of the result of the extraction process. Bisdw use the open source project
Scriptellato enable ETL functionality, but can be extended in a simple way.
Bisdw runs as a daemon process. By configuration Bisdw will run different ETL jobs based on scheduling definition that are defined for each job.
2 Configuration
The main configuration file is the bidw.xml. Additional common properties are set in the properties.xml fle.
2.1 bisdw.xml configuration
Bisdw is configured by defining ETLs jobs in the in the bisdw.xml file.
1<?xml version="1.0" encoding="UTF-8"?>
2<bisdw xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="../src/main/resources/bisdw.xsd">
3 <etljob>
4 <name>read_DB</name>
5 <desc>Database</desc>
6 <schedule>* */30 * * * ?</schedule>
7
8 <etlconfig name="read_db">
9 <desc>read db</desc>
10 <class>com.ingby.socbox.bisdw.etlprovider.ETLScriptella</class>
11 <order>10</order>
12 <property>
13 <key>configFile</key>
14 <value>read_db.xml</value>
15 </property>
16 </etlconfig>
17
18 <etlconfig name="send_ftp">
19 <desc>send_ftp</desc>
20 <class>com.ingby.socbox.bisdw.etlprovider.FTPSend</class>
21 <order>20</order>
22 <property>
23 <key>hostname</key>
24 <value>172.25.1.203</value>
25 </property>
26 <property>
27 <key>username</key>
28 <value>ftpuser</value>
29 </property>
30 <property>
31 <key>password</key>
32 <value>abcxyz</value>
33 </property>
34 <property>
35 <key>localDir</key>
36 <value>/tmp/shipment</value>
37 </property>
38 <property>
39 <key>remoteDir</key>
40 <value>ship</value>
41 </property>
42 </etlconfig>
43 </etljob>
44</bisdw>
Each <etljob> can include one to many <etlconfig>. A <etljob> will execute all etlconfig’s part of the job according to the cron expression in the <schedule> tag, but in the sequence order specified in the >order> tag. The etlconfig’s will be executed in the order of the lowest order first. This is important if there is a data dependencies between the different etlconfig’s. By using the order tag a sequence of execution is quarantined.
Each etlconfig has a class specification that define the java class to be executed by the etljob. A list of properties can be specified that is specific by the specified class.
2.1.1 Scheduling
Each etljob can have multiple schedule tags, but at least one. The scheduling can have two different formats, interval or cron based.
2.1.1.1 Interval scheduling
The simple format describe a interval execution that are repeated forever. The format is just a number and a indicator defining the granularity in seconds (S), minutes (M) or hours (H). 10M specify that the service should be executed every ten minutes.
1<schedule>10M</schedule>
2.1.1.2 Cron based scheduling
1<schedule>0 15 10 ? * MON-FRI</schedule>
2.2 Etlconfig configuration
2.2.1 ETLScriptella configuration
The ETLScriptella takes one property from the etlconfig definition in bisdw.xml. The property is the
configFile and specify the scriptella script file to read and execute. For more information about Scriptella configuration please visit
Scriptella documentation.
Below listing show a simple example of a scriptella configuration that select from a mysql database and create an xml file based on the result and write it to directory /tmp/shipment. Each file will be written to a unique file name using the etl.date.today macro.
1<!DOCTYPE etl SYSTEM "http://scriptella.javaforge.com/dtd/etl.dtd">
2<etl>
3 <description>Example</description>
4 <properties>
5 filedepo=/tmp/shipment/
6 </properties>
7 <connection id="mydb" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/bisdwtest" user="testdb" password="testdb" />
8
9 <connection id="outfile" driver="text" url="$filedepo/shipment_${etl.date.today(’yyyyMMdd-HHmmss’)}.xml" >
10 null_string=
11 </connection>
12
13 <!--Writing header -->
14 <script connection-id="outfile">
15 <shipments>
16 </script>
17
18 <!-- Run query against db -->
19 <query connection-id="mydb">
20 <!-- Select all shipments with pckp_dt TODAY and inbnd_ind = O from shipmentdb-->
21 SELECT ship_id, no_pce, tot_wght FROM shipment;
22
23 <!-- For each row execute a script and write xml structure to out file -->
24 <script connection-id="outfile">
25 <![CDATA[
26 <shipment>
27 <id>$ship_id</id>
28 <weight>$tot_wght</weight>
29 <pieces>$no_pce</pieces>
30 </shipment>]]>
31 </script>
32 </query>
33
34 <!--Writing footer -->
35 <script connection-id="outfile">
36 </shipments>
37 </script>
38</etl>
39
2.2.2 FTPSend configuration
The FTPSend class enable transfer over FTP to a remote server. This can typical be used as part of the of a etljob where files are created by a scriptella script that is after creation is transfered to a FTP server.
The FTPSend configuration takes a number of properties to control the connection.
-
hostname - the name or IP of the ftp server to connect to.
-
port - the socket port used by the ftp server, default is 21.
-
timeout - connection timeout, default is 2000 ms.
-
username - username for the ftp server.
-
password - password for the username
-
transferMode - ascii or binary, default is ascii
-
connectionMode - active or passive, default is passive.
-
remoteDir - the directory on ftp server side, default is the current directory after login.
-
localDir - the directory on the local side.
-
moveFileAfterSend - move the files after transfer to the directory .save in the localDir default is true. If set to false the file is just deleted after successful transfer.
As and example see the listing in the
2.1 on page 1↑ that read any file located in the
/tmp/shipment directory.
2.3 properties.xml
The properties.xml include properties used by the core of Bisdw. The properties xml has a simple structure of key/value pair:
1<properties>
2 <property>
3 <key>akey</key>
4 <value>avalue</value>
5 </property>
6</properties>
The following properties are currently used by core Bisdw:
-
pidfile - the pid file for Bischeck, default is "/var/tmp/bisdw.pid".
3 Building Bisdw
To build Bisdw from source is simple. Check out the Bisdw trunk from gforge.ingby.com:
$ svn checkout --username anonymous http://gforge.ingby.com/svn/bischeck/bisdw/trunk bisdw
To build a Bisdw distribution run from the directory where you checked out the Bisdw code:
This will create a compressed tar file in the
target directory, named bisdw-x.y.z.tgz where x.y.z is the version number. Different versions of Bisdw can be checked out from the tags directory located in
http://gforge.ingby.com/svn/bischeck/bisdw/tags
3.1 Jar customization
To support custom jar files please place them in the directory customlib. This would typical be jdbc drivers, etc.
3.2 Developing with Bisdw
Its simple to develop your own ETL config implementation. To develop your own you must follow the interface ETLInf.
4 Installation
Download the distribution file and follow the steps below to install. Make sure you have root privileges doing this.
# tar xzvf bisdw-x.y.z.tgz
# cd bisdw-x.y.z
# chmod 755 install
# ./install -u #Get usage
# ./install #Install default
# service bisdw start #Redhat/Centos
# /etc/init.d/bischeck start #Debian/Ubuntu
To get full list of available options to the install script use -u. By default the install script will install Bisdw in directory /opt/socbox/addons/bisdw, referred to as $BISDW and with the ownership of the user id bisdw. Make sure that the user exist before running install.
The last commands start the bisdw daemon with the effective user id of the user id set during install, default user bisdw. The installation will configure bisdwd to start automatically in run level 3, 4 and 5.
The process id of the java process running bisdw in daemon mode is located in a file, default in /var/tmp/bisdw.pid. This file is used by the bisdwd script to stop the java program running Bisdw and make sure that only one instance of Bisdw is started on the server.
4.1 Getting started
In the $BISDW/etc directory there are examples of all the configuration files.
4.2 Logging
Bisdw use log4j for log management. The log4j configuration is described in the log4.properties file located in the resources directory of the Bisdw installation. By default Bisdw writes log information at level INFO to file /var/tmp/bischeck.log.
5 Run Bisdw from the command line
The normal way to run Bisdw is as a daemon using the init.d script bisdwd, but is also possible to start Bisdw in continues running mode by executing:
Running in this way have limitations since the execution will not automatically be placed as a background process and the effective user id will be the user starting the process which may not have all permissions according to the installation. Neither will pid files be updated correctly. For production system always use the init.d script.
$ sudo /etc/init.d/bisdwd start
or
For testing purpose it can be good to just run Bisdw once and make sure that every thing is executing as expected. This is done by executing:
To show the pid file used for the Bischeck daemon running:
$ bischeck ConfigurationManager -p
This command is used in the init script bischeckd to retrieve the current pid.
5.1 Validating configuration files
To validate if the xml configuration files are correct the following command will return 0 if correct. Use $? to see return status.
$ bischeck ConfigurationManager -v; echo $?
6 Releases
6.1 Release 0.2.0 - 2013-10-06
This is the first major version.
7 System requirements
Bisdw should run on any operating system that supports Java 6. The installation script and init scripts are supported on Redhat and Debian equivalent Linux distributions. Running on none Linux operating system has not been tested.
The following jar packages are distributed as part of the Bisdw distribution. All these packages have their own open source licenses.
All files are distributed as part of Bisdw are located in the lib directory.
8 Bisdw license
9 Bug reports and feature requests
10 Credits
Thanks to all people who has developed all the great software that Bisdw depends on.