Spring Batch in Action: A Sample Application

By Arnaud Cogoluegnes, Thierry Templier, Gary Gregory, Olivier Bazoud, authors of Spring Batch in Action

Batch applications are a challenge to write, and that's why Spring Batch was created: to make them easier to write but also faster, more robust, and reliable. Batch applications process large amounts of data without human intervention. You'd opt to use batch applications to compute data for generating monthly financial statements, calculating statistics, and indexing files. This article, based on chapter 1 of Spring Batch in Action, implements a real-world Spring Batch job. By the end of this article, you'll have an overview of what Spring Batch does, and you'll be ready to implement your first job with Spring Batch. Let's get started with batch applications!

This article introduces a real application that illustrates the use of Spring Batch: an online store application. This use case starts out small and simple but remains realistic in terms of technical requirements. It not only demonstrates Spring Batch features but also illustrates how this use case fits into the enterprise landscape.

By implementing this use case using Spring Batch, you gain a practical understanding of the framework: how it implements efficient reading and writing of large volumes of data, when to use built-in components, when to implement your own components, how to configure a batch job with the Spring lightweight container, and much more. By the end of this article, you'll have a good overview of how Spring Batch works.

The online store application

The ACME Corporation wants to expand its business by selling its products on the web. To do so, ACME chooses to build a dedicated online store application. ACME will use batch jobs to populate the online store database with the catalog from its internal proprietary system, as shown in figure 1. The system will process data every night to insert new products in the catalog or update existing products.

Figure 1 Thanks to this new application, anyone can buy ACME's products online. The system sends catalogs to a server where a batch process reads them and writes product records into the online store database.

That's it for the big picture, but you should understand why ACME decided to build an online store in the first place and populate it using batch processes.

Why build an online store with batch jobs?

Why did ACME choose to build an online, web-based application to sell its products? As we mentioned, this is the best way for ACME to expand its business and to serve more customers. Web applications are easy to deploy, easy to access, and can provide a great user experience. ACME plans to deploy the online store application to a local web hosting provider rather than hosting it on its own network. The first version of the online store will provide a simple but efficient UI; ACME focuses on the catalog and transactions first, before providing more features and a more elaborate UI.

Next, why did ACME choose to shuttle data from one system to the other instead of making its onsite catalog and the online store communicate directly? The software that powers the catalog has an API, so why not use it? The main reason is security: as illustrated in figure 2, ACME's own network hosts the catalog system, and the company doesn't want to expose the catalog system to the outside world directly, even via another application. This precaution is rather drastic, but that's how things are done at ACME.

Figure 2 Because ACME doesn't want its internal catalog system to be directly accessible from the outside world, it doesn't allow the two applications to communicate directly and exchange data.

Another reason for this architecture is that the catalog system's API and data format don't suit the needs of the online store application: ACME wants to show a summarized view of the catalog data to customers without overwhelming them with a complex catalog structure and supplying too many details. You could get this summarized catalog view by using the catalog system's API, but you'd need to make many calls, which would cause performance to suffer in the catalog system.

To summarize, a mismatch exists between the view of the data provided by the catalog system and the view of the data required by the online store application. Therefore, an application needs to process the data before exposing it to customers through the online store.

Why use batch processes?

The online store application scenario is a good example of two systems communicating to exchange data. ACME updates the catalog system throughout the day, adding new products and updating existing products. The online store application doesn't need to expose live data because buyers can live with day-old catalog information.

Figure 3 An extract, transform, and load (ETL) process extracts and transforms the catalog system data into a flat file, which ACME sends every night to a Spring Batch process. This Spring Batch process is in charge of reading the flat file and importing the data into the online store database.

Therefore, a nightly batch process updates the online store database, using flat files, as shown in figure 3.

Extract, transform, and load (ETL)

Briefly stated, ETL is a process in the database and data-warehousing world that performs the following steps:

  1. Extracts data from an external data source
  2. Transforms the extracted data to match a specific purpose
  3. Loads the transformed data into a data target: a database or data warehouse

Many products, both free and commercial, can help create ETL processes. This is a bigger topic than we can address here, but it isn't always as simple as these three steps. Writing an ETL process can present its own set of challenges involving parallel processing, rerunnability, and recoverability. The ETL community has developed its own set of best practices to meet these and other requirements.

In figure 3, an ETL process creates the flat file to populate the online store database. It extracts data from the catalog system and transforms it to produce the view expected by the online store application. For the purpose of our discussion, this ETL process is a black box: it could be implemented with an ETL tool (like Talend) or even with another Spring Batch job. We focus next on how the online store application reads and writes the catalog's product information.

The import product use case

The online store application sells products out of a catalog, making the product a main domain concept. The import product batch reads the product records from a flat file created by ACME and updates the online store application database accordingly. Figure 4 illustrates that reading and writing products is at the core of this batch job, but it contains other steps as well.

Figure 4 The Spring Batch job consists of the following steps: decompression and read-write.

The read-write step forms the core of the batch job, but as figure 4 shows, this isn't the only step. This batch job consists of the following steps:

  1. Decompression—Decompresses the archive flat file received from the ACME network. The file is compressed to speed up transfer over the internet.
  2. Reading and writing —The flat file is read line by line and then inserted into the database.

This batch process allows us to introduce the Spring Batch features displayed in table 1.

Table 1 Spring Batch features introduced by the import catalog job

Batch process stepSpring Batch feature
Decompression
  • Custom processing in a job (but not reading from a data store and writing to another)
Read-write
  • Reading a flat file
  • Implementing a custom database writing component
  • Skipping invalid records instead of failing the whole process
Configuration
  • Leveraging Spring's lightweight container and Spring Batch's namespace to wire up batch components
  • Using the Spring Expression Language to make the configuration more flexible

Rather than describe each of Spring Batch's features in the order in which they appear as batch job steps, we start with the core of the process: reading and writing the products. Then we see how to decompress the incoming file before making the process more robust by validating the input parameters and choosing to skip invalid records to avoid the whole job failing on a single error.

Reading and writing the product data

Reading and writing the product catalog is at the core of the Spring Batch job. ACME provides the product catalog as a flat file, which the job needs to import into the online store database. Reading and writing is Spring Batch's sweet spot: for the import product job, you only have to configure one Spring Batch component to read the content of the flat file, implement a simple interface for the writing component, and create a configuration file to handle the batch execution flow. Table 1 lists the Spring Batch features introduced by the import catalog job. Let's start by using Spring Batch to implement the read-write use case.

Anatomy of the read-write step

Because read-write (and copy) scenarios are common in batch applications, Spring Batch provides specific support for this use case. Spring Batch includes many ready-touse components to read from and write to data stores like files and databases.

Figure 5 In read-write scenarios, Spring Batch uses chunk processing. Spring Batch reads items one by one from an ItemReader, collects the items in a chunk of a given size, and sends that chunk to an ItemWriter.

Batch also includes a batch-oriented algorithm to handle the execution flow, called chunk processing. Figure 5 illustrates the principle of chunk processing.

Spring Batch handles read-write scenarios by managing an ItemReader and an ItemWriter. Using chunk processing, Spring Batch collects items one at a time from the item reader into a configurable-sized chunk. Spring Batch then sends the chunk to the item writer and goes back to using the item reader to create another chunk, and so on, until the input is exhausted.

CHUNK PROCESSING Chunk processing is particularly well suited to handle large data operations because a job handles items in small chunks instead of processing them all at once. Practically speaking, a large file won't be loaded in memory; instead it's streamed, which is more efficient in terms of memory consumption. Chunk processing allows more flexibility to manage the data flow in a job. Spring Batch also handles transactions and errors around read and write operations.

Spring Batch provides an optional processing step in chunk processing: a job can process (transform) read items before sending them to the ItemWriter. The ability to process an item is useful when you don't want to write an item as is. The component that handles this transformation is an implementation of the ItemProcessor interface. Because item processing in Spring Batch is optional, the illustration of chunk processing shown in figure 5 is still valid. Figure 6 illustrates chunk processing combined with item processing.

Figure 6 Chunk processing combined with item processing: an item processor can transform input items before calling the item writer.

What can you do in an ItemProcessor? You can perform any transformations you need on an item before Spring Batch sends it to the ItemWriter. This is where you implement the logic to transform the data from the input format into the format expected by the target system. Spring Batch also lets you validate and filter input items. If you return null from the ItemProcessor method process, processing for that item stops and Spring Batch won't insert the item in the database.

NOTE Our read-write use case doesn't have an item-processing step.

The following listing shows the definition of the chunk-processing interfaces ItemReader, ItemProcessor, and ItemWriter.

Listing 1 Spring Batch interfaces for chunk processing

package org.springframework.batch.item; 
public interface ItemReader { T read() throws Exception, UnexpectedInputException, ParseException, NonTransientResourceException; } package org.springframework.batch.item;
public interface ItemProcessor { O process(I item) throws Exception; } package org.springframework.batch.item;
import java.util.List; public interface ItemWriter { void write(List items) throws Exception; }

The next two subsections show how to configure the Spring Batch flat file ItemReader and how to write your own ItemWriter to handle writing products to the database.

Reading a flat file

Spring Batch provides the FlatFileItemReader class to read records from a flat file. To use a FlatFileItemReader, you need to configure some Spring beans and implement a component that creates domain objects from what the FlatFileItemReader reads; Spring Batch will handle the rest. You can kiss all your old boilerplate I/O code goodbye and focus on your data.

THE FLAT FILE FORMAT

The input flat file format consists of a header line and one line per product record. Here's an excerpt:


PRODUCT_ID,NAME,DESCRIPTION,PRICE 
PR....210,BlackBerry 8100 Pearl,A cell phone,124.60
PR....211,Sony Ericsson W810i,Yet another cell phone!,139.45
PR....212,Samsung MM-A900M Ace,A cell phone,97.80
PR....213,Toshiba M285-E 14,A cell phone,166.20
PR....214,Nokia 2610 Phone,A cell phone,145.50

You may recognize this as the classic comma-separated value (CSV) format. There's nothing out of the ordinary in this flat file: for a given row, the format separates each column value from the next with a comma. Spring Batch maps each row in the flat file to a Product domain object.

THE PRODUCT DOMAIN CLASS

The Product class maps the different columns of the flat file. Note the instance variable declarations for product attributes like id, name, price, and so on, in this snippet; the getter and setter methods are excluded for brevity:


package com.manning.sbia.ch01.domain;

import java.math.BigDecimal;

public class Product {

  private String id;
  private String name;
  private String description;
   private BigDecimal price;

   (...)

}
NOTE We use a BigDecimal for the product price because the Java float and double primitive types aren't well suited for monetary calculations. For example, it's impossible to exactly represent 0.1.

Let's now use the FlatFileItemReader to create Product objects out of the flat file.

CREATING DOMAIN OBJECTS WITH A FLATFILEITEMREADER

The FlatFileItemReader class handles all the I/O for you: opening the file, streaming it by reading each line, and closing it. The FlatFileItemReader class delegates the mapping between an input line and a domain object to an implementation of the LineMapper interface. Spring Batch provides a handy LineMapper implementation called DefaultLineMapper, which delegates the mapping to other strategy interfaces. Figure 7 shows all of this delegation work.

Figure 7 The FlatFileItemReader reads the flat file and delegates the mapping between a line and a domain object to a LineMapper. The LineMapper implementation delegates the splitting of lines and the mapping between split lines and domain objects.

That's a lot of delegation, and it means you'll have more to configure, but such is the price of reusability and flexibility. You'll be able to configure and use built-in Spring Batch components or provide your own implementations for more specific tasks.

The DefaultLineMapper is a typical example; it needs

You'll soon see the whole Spring configuration in listing 2 (LineTokenizer is of particular interest), but next we focus on the FieldSetMapper implementation to create Product domain objects.

IMPLEMENTING A FIELDSETMAPPER FOR PRODUCT OBJECTS

You use a FieldSetMapper to convert the line split by the LineTokenizer into a domain object. The FieldSetMapper interface is straightforward:

public interface FieldSetMapper {
  T mapFieldSet(FieldSet fieldSet) throws BindException;
}

The FieldSet parameter comes from the LineTokenizer. Think of it as an equivalent to the JDBC ResultSet: it retrieves field values and performs conversions between String objects and richer objects like BigDecimal. The following snippet shows the ProductFieldSetMapper implementation:


package com.manning.sbia.ch01.batch;

import org.springframework.batch.item.file.mapping.FieldSetMapper; import org.springframework.batch.item.file.transform.FieldSet; import org.springframework.validation.BindException;

import com.manning.sbia.ch01.domain.Product;

public class ProductFieldSetMapper implements FieldSetMapper {

  public Product mapFieldSet(FieldSet fieldSet) throws BindException {   
    Product product = new Product(); 
    product.setId(fieldSet.readString("PRODUCT_ID")); 
    product.setName(fieldSet.readString("NAME"));   
    product.setDescription(fieldSet.readString("DESCRIPTION")); 
    product.setPrice(fieldSet.readBigDecimal("PRICE"));
    return product;
  }
}

The ProductFieldSetMapper implementation isn't rocket science, and that's exactly the point: it focuses on retrieving the data from the flat file and converts values into Product domain objects. We leave Spring Batch to deal with all of the I/O plumbing and efficiently reading the flat file. Notice in the mapFieldSet method the String literals PRODUCT_ID, NAME, DESCRIPTION, and PRICE. Where do these references come from? They're part of the LineTokenizer configuration, so let's study the Spring configuration for FlatFileItemReader.

CONFIGURATION OF THE FLATFILEITEMREADER

The FlatFileItemReader can be configured like any Spring bean using an XML configuration file, as shown in the following listing.

Listing 2 Spring configuration of the FlatFileItemReader

<bean id="reader" 
      class="org.springframework.batch.item.file.FlatFileItemReader">
  <property name="resource"
            value="file:./work/output/output.txt" />
  <property name="linesToSkip" value="1" /> 
<property name="lineMapper"> <bean class="org.springframework.batch.item.file.mapping.DefaultLineMapper"> <property name="lineTokenizer"> <bean class="org.springframework.batch.item.file.transform.
[CA] DelimitedLineTokenizer"> &l;property name="names" value="PRODUCT_ID, [CA] NAME,DESCRIPTION,PRICE" /> </bean> </property> <property name="fieldSetMapper"> <bean class="com.manning.sbia.ch01.batch. [CA] ProductFieldSetMapper" /> </property> </bean> </property> </bean>

In this example, the resource property defines the input file. Because the first line of the input file contains headers, you ask Spring Batch to skip this line by setting the property linesToSkip (#1) to 1. You use a DelimitedLineTokenizer (#2) to split each input line into fields; Spring Batch uses a comma as the default separator. Then you define the name of each field. These are the names used in the ProductFieldSetMapper class to retrieve values from the FieldSet. Finally, you inject an instance of ProductFieldSetMapper into the DefaultLineMapper.

That's it; your flat file reader is ready! Don't feel overwhelmed because flat file support in Spring Batch uses many components—that's what makes it powerful and flexible. Next up, to implement the database item writer, you need to do less configuration work but more Java coding. Let's dig in.

Implementing a database item writer

To update the database with product data, you have to implement your own ItemWriter. Each line of the flat file represents either a new product record or an existing one, so you must decide whether to send the database an insert or an update SQL statement. Nevertheless, the implementation of the ProductJdbcItemWriter is straightforward, as shown in the following listing. Listing 3 Implementing the ProductJdbcItemWriter
package com.manning.sbia.ch01.batch;

import java.util.List;
import javax.sql.DataSource;
import org.springframework.batch.item.ItemWriter; import org.springframework.jdbc.core.JdbcTemplate; import com.manning.sbia.ch01.domain.Product;

public class ProductJdbcItemWriter implements ItemWriter {

  private static final String INSERT_PRODUCT = "insert into product "+   
   "(id,name,description,price) values(?,?,?,?)";

  private static final String UPDATE_PRODUCT = "update product set "+ 
   "name=?, description=?, price=? where id=?";

  private JdbcTemplate jdbcTemplate; 
public ProductJdbcItemWriter(DataSource ds) { this.jdbcTemplate = new JdbcTemplate(ds); } public void write(List items) throws Exception { for (Product item : items) { int updated = jdbcTemplate.update(
UPDATE_PRODUCT, item.getName(),item.getDescription(), item.getPrice(),item.getId() ); if (updated == 0) { jdbcTemplate.update(
INSERT_PRODUCT, item.getId(),item.getName(), item.getDescription(),item.getPrice() ); } } } }

The ProductJdbcItemWriter uses Spring's JdbcTemplate to interact with the database. Spring Batch creates the JdbcTemplate with a DataSource injected in the constructor (#1). In the write method, you iterate over a chunk of products and first try to update an existing record (#2). If the database tells you the update statement didn't update any record, you know this record doesn't exist, and you can insert it (#3).

That's it for the implementation! Notice how simple it was to implement this ItemWriter because Spring Batch handles getting records from the ItemReader, creating chunks, managing transactions, and so on. Next, let's configure the database item writer.

Configuring a database item writer

For the item writer to be configured as a Spring bean, it needs a DataSource, as shown in the following XML fragment:


<bean id="writer" 
      class="com.manning.sbia.ch01.batch.ProductJdbcItemWriter">
  <constructor-arg ref="dataSource" />
</bean>

You'll configure the DataSource later, in a separate configuration file. You use a separate file because it decouples the application configuration—the item writer—from the infrastructure configuration—the DataSource. By doing so, you can use the same application configuration across different environments—production and testing, for example—and switch the infrastructure configuration file.

Now that you've created the two parts of the read-write step, you can assemble them in a Spring Batch job.

Configuring the read-write step

Configuring the read-write step is done through Spring. The step configuration can sit next to the declaration of the reader and writer beans, as shown in the following listing.

Listing 4 Spring configuration of the read-write step

<beans xmlns="http://www.springframework.org/schema/beans"    
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"    
    xmlns:batch="http://www.springframework.org/schema/batch"  
    xsi:schemaLocation="http://www.springframework.org/schema/beans 
    http://www.springframework.org/schema/beans/spring-beans-3.0.xsd 
    http://www.springframework.org/schema/batch 
    http://www.springframework.org/schema/batch/spring-batch-2.1.xsd">

  <job id="importProducts"  
xmlns="http://www.springframework.org/schema/batch"> <step id="readWriteProducts"> <tasklet> <chunk reader="reader" writer="writer"
commit-interval="100" />
</tasklet> </step> </job> <bean id="reader" (...) </bean> <bean id="writer" (...) </bean> </beans>

The configuration file starts with the usual declaration of the namespaces and associated prefixes: the Spring namespace and Spring Batch namespace with the batch prefix. The Spring namespace is declared as the default namespace, so you don't need to use a prefix to use its elements. Unfortunately, this is inconvenient for the overall configuration because you must use the batch prefix for the batch job elements. To make the configuration more readable, you can use a workaround in XML: when you start the job configuration XML element (#1), you specify a default XML namespace as an attribute of the job element. The scope of this new default namespace is the job element and its child elements.

The chunk element (#2) configures the chunk-processing step, in a step element, which is itself in a tasklet element. In the chunk element, you refer to the reader and writer beans with the reader and writer attributes. The values of these two attributes are the IDs previously defined in the reader and writer configuration. Finally, at #3 the commit-interval attribute is set to a chunk size of 100.

Choosing a chunk size and commit interval
First, the size of a chunk and the commit interval are the same thing! Second, there's no definitive value to choose. Our recommendation is a value between 10 and 200. Too small a chunk size creates too many transactions, which is costly and makes the job run slowly. Too large a chunk size makes transactional resources—like databases—run slowly too, because a database must be able to roll back operations. The best value for the commit interval depends on many factors: data, processing, nature of the resources, and so on. The commit interval is a parameter in Spring Batch, so don't hesitate to change it to find the most appropriate value for your jobs.

You're done with the copy portion of the batch process. Spring Batch performs a lot of the work for you: it reads the products from the flat file and imports them into the database. You didn't write any code for reading the data. For the write operation, you only created the logic to insert and update products in the database. Putting these components together is straightforward thanks to Spring's lightweight container and the Spring Batch XML vocabulary.

So far, you've implemented the box labeled "Reading and writing" from figure 4. As you've seen, Spring Batch provides a lot of help for this common use case. The framework is even richer and more flexible because a batch process can contain any type of write operation. You'll see an example of this next, when you decompress the input file for your job, as shown in figure 4 in the box labeled "Decompressing."

Decompressing the input file with a tasklet

Remember that the flat file is uploaded to the online store as a compressed archive. You need to decompress this file before starting to read and write products. Decompressing a file isn't a read-write step, but Spring Batch is flexible enough to implement such a task as part of a job. Before showing you how to decompress the input file, let's explain why you must compress the products flat file.

Why compress the file?

The flat file containing the product data is compressed so you can upload it faster from ACME's network to the provider that hosts the online store application. Textual data, as used in the flat file, can be highly compressed, with ratios of 10 to 1 commonly achieved. A 1-GB flat file can compress to 100 MB, which is a more reasonable size for file transfers over the internet.

Note that you could encrypt the file as well, ensuring that no one could read the product data if the file were intercepted during transfer. The encryption could be done before the compression or as part of it. In this case, assume that ACME and the hosting provider agreed on a secure transfer protocol, like Secure Copy (SCP is built on top of Secure Shell [SSH]).

Now that you know why you compress the file, let's see how to implement the decompression tasklet.

Implementing the decompression tasklet

Spring Batch provides an extension point to handle processing in a batch process step: the Tasklet. You implement a Tasklet that decompresses a ZIP archive into its source flat file. The following listing shows the implementation of the Decompress-Tasklet class.

Listing 5 Implementation of decompression tasklet

package com.manning.sbia.ch01.batch;

import java.io.BufferedInputStream; import java.io.BufferedOutputStream; import java.io.File;
import java.io.FileOutputStream; import java.util.zip.ZipInputStream; 
import org.apache.commons.io.FileUtils; import org.apache.commons.io.IOUtils;
import org.springframework.batch.core.StepContribution;
import org.springframework.batch.core.scope.context.ChunkContext;
import org.springframework.batch.core.step.tasklet.Tasklet;
import org.springframework.batch.repeat.RepeatStatus;
import org.springframework.core.io.Resource;

public class DecompressTasklet implements Tasklet { 
private Resource inputResource;
private String targetDirectory; private String targetFile; public RepeatStatus execute(StepContribution contribution, ChunkContext chunkContext) throws Exception { ZipInputStream zis = new ZipInputStream(
new BufferedInputStream( inputResource.getInputStream())); File targetDirectoryAsFile = new File(
targetDirectory); if(!targetDirectoryAsFile.exists()) { FileUtils.forceMkdir(targetDirectoryAsFile); } File target = new File(targetDirectory,targetFile); BufferedOutputStream dest = null; while(zis.getNextEntry() != null) {
if(!target.exists()) { target.createNewFile(); } FileOutputStream fos = new FileOutputStream( target ); dest = new BufferedOutputStream(fos); IOUtils.copy(zis,dest); dest.flush(); dest.close(); } zis.close(); if(!target.exists()) { throw new IllegalStateException( "Could not decompress anything from the archive!"); } return RepeatStatus.FINISHED;
} /* setters */ (...)

The DecompressTasklet class implements the Tasklet interface (#1), which has only one method, called execute. The tasklet has three fields (#2), which represent the archive file, the name of the directory to which the file is decompressed, and the name of the output file. These fields are set when you configure the tasklet with Spring. In the execute method, you open a stream to the archive file (#3), create the target directory if it doesn't exist (#4), and use the Java API to decompress the ZIP archive (#5). Note that the FileUtils and IOUtils classes from the Apache Commons IO project are used to create the target directory and copy the ZIP entry content to the target file (Apache Commons IO provides handy utilities to deal with files and directories). At #6, you return the FINISHED constant from the RepeatStatus enumeration to notify Spring Batch that the tasklet finished.

Only a data file and no metadata file in the ZIP archive?
It's common practice to have two files in a ZIP archive used for a batch job. One file contains the data to import, and the other contains information about the data to import (date, identifier, and so on). We wanted to keep things simple in our Spring Batch introduction, especially the tedious unzipping code, so our ZIP archive contains only a data file. Let's say the name of the unzipped file is made up of meaningful information such as the date and an identifier for the import.

Although the Tasklet interface is straightforward, its implementation includes a lot of code to deal with decompressing the file. Let's now see how to configure this tasklet with Spring.

Configuring the tasklet

The tasklet is configured as part of the job and consists of two changes in Spring: declare the tasklet as a Spring bean and inject it as a step in the job. To do this, you must modify the configuration you wrote for reading and writing products, as shown in the following listing.

Listing 6 Spring configuration of the decompress tasklet

<job id="importProducts" 
     xmlns="http://www.springframework.org/schema/batch">
  <step id="decompress" next="readWriteProducts"> 
<tasklet ref="decompressTasklet" /> </step> <step id="readWriteProducts"> <tasklet> <chunk reader="reader" writer="writer" commit-interval="100" /> </tasklet> </step> </job> <bean id="decompressTasklet"
class="com.manning.sbia.ch01.batch. [CA] DecompressTasklet"> <property name="inputResource" value="file:./input/input.zip" /> <property name="targetDirectory" value="./work/output/" /> <property name="targetFile" value="products.txt" /> </bean>

The configuration of a plain Tasklet is simpler than for a read-write step because you only need to point the Tasklet to the (decompression) bean (#1). Note that you control the job flow through the next attribute of the step element, which refers to the readWriteProducts step by ID. The tasklet element (#1) refers to the decompressTasklet bean, declared at #2. If you find that the Tasklet bean is configured too rigidly in the Spring file (because the values are hardcoded), don't worry: we'll show you later in this article how to make these settings more dynamic.

You now have all the parts of the job implemented and configured: you can decompress the input archive, read the products from the decompressed flat file, and write them to the database. You're now about to see how to launch the job inside an integration test.

Testing the batch process

Batch applications are like any other applications: you should test them using a framework like JUnit. Testing makes maintenance easier and detects regressions after refactoring. Let's test, then! This section covers how to write an integration test for a Spring Batch job. You'll also learn about the launching API in Spring Batch. But don't be too impatient—we need a couple of intermediary steps before writing the test: configuring a test infrastructure and showing you a trick to make the job configuration more flexible.

Spring Batch and test-driven development Good news: Spring Batch and test-driven development are fully compatible! We introduce here some techniques to test a Spring Batch job. We truly believe in test-driven development, so we test all the source code with automated tests. The next section is about setting up the test infrastructure: the ACME job needs a database to write to, and Spring Batch itself needs a couple of infrastructure components to launch jobs and maintain execution metadata. Let's see how to configure a lightweight test infrastructure to launch the test from an IDE.
Setting up the test infrastructure
Spring Batch needs infrastructure components configured in a Spring lightweight container. These infrastructure components act as a lightweight runtime environment to run the batch process. Setting up the batch infrastructure is a mandatory step for a batch application, which you need to do only once for all jobs living in the same Spring application context. The jobs will use the same infrastructure components to run and to store their state. These infrastructure components are the key to managing and monitoring jobs.

Spring Batch needs two infrastructure components:

For this test, you use the volatile job repository implementation. It's perfect for testing and prototyping because it stores execution metadata in memory. The following listing shows how to configure the test infrastructure.

Listing 7 Spring configuration for the batch infrastructure

<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"    
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"  
       xmlns:jdbc="http://www.springframework.org/schema/jdbc"     
       xsi:schemaLocation="http://www.springframework.org/schema/jdbc 
       http://www.springframework.org/schema/jdbc/spring-jdbc-3.0.xsd 
       http://www.springframework.org/schema/beans 
       http://www.springframework.org/schema/beans/spring-beans-3.0.xsd">

  <jdbc:embedded-database id="dataSource" type="H2"> 
<jdbc:script location="/create-tables.sql"/> </jdbc:embedded-database> <bean id="transactionManager"
class="org.springframework.jdbc.datasource. [CA] DataSourceTransactionManager"> <property name="dataSource" ref="dataSource" /> </bean> <bean id="jobRepository"
class="org.springframework.batch.core. [CA] repository.support.MapJobRepositoryFactoryBean"> <property name="transactionManager" ref="transactionManager" /> </bean> <bean id="jobLauncher"
class="org.springframework.batch.core.launch. [CA] support.SimpleJobLauncher"> <property name="jobRepository" ref="jobRepository" /> </bean> <bean class="org.springframework.jdbc.core.JdbcTemplate"> <constructor-arg ref="dataSource" /> </bean> </beans>

This listing uses an open source in-memory database called H2; although it may look odd for an online application, it's easy to deploy and you won't have to install any database engine to work with the code samples in this article. And remember, this is the testing configuration; the application can use a full-blown, persistent database in production. For a more traditional relational database management system (RDBMS) setup, you could change the data source configuration to use a database like PostgreSQL or Oracle. Listing 7 also runs a SQL script on the database to create the product table and configures a JdbcTemplate to check the state of the database during the test.

How does a job refer to the job repository?
You may have noticed that we say a job needs the job repository to run but we don't make any reference to the job repository bean in the job configuration. The XML step element can have its job-repository attribute refer to a job repository bean. This attribute isn't mandatory, because by default the job uses a jobRepository bean. As long as you declare a jobRepository bean of type JobRepository, you don't need to explicitly refer to it in your job configuration.

This leads us to the following best practice: when configuring a Spring Batch application, the infrastructure and job configuration should be in separate files.

SPLITTING INFRASTRUCTURE AND APPLICATION CONFIGURATION FILES You should always split infrastructure and application configuration files (test-context.xml and import-products-job-context.xml in our example). This allows you to swap out the infrastructure for different environments (test, development, staging, production) and still reuse the application (job, in our case) configuration files.

In a split application configuration, the infrastructure configuration file defines the job repository and data source beans; the job configuration file defines the job and depends on the job repository and data source beans. For Spring to resolve the whole configuration properly, you must bootstrap the application context from both files.

You completed the infrastructure and job configuration in a flexible manner by splitting the configuration into an infrastructure file and a job file. Next, you make the configuration more flexible by leveraging the Spring Expression Language (SpEL) to avoid hardcoding certain settings in Spring configuration files.

Leveraging SpEL for configuration

Remember that part of your job configuration is hardcoded in the Spring configuration files, such as all file location settings (in bold):


<bean id="decompressTasklet" 
      class="com.manning.sbia.ch01.batch.DecompressTasklet">
  <property name="inputResource" value="file:./input/input.zip" />
  <property name="targetDirectory" value="./work/output/" />
  <property name="targetFile" value="products.txt" />
</bean>

These settings aren't flexible because they can change between environments (testing and production, for example) and because rolling files might be used for the incoming archive (meaning the filename would depend on the date). An improvement is to turn these settings into parameters specified at launch time. When launching a Spring Batch job, you can provide parameters, as in the following:


jobLauncher.run(job, new JobParametersBuilder()
  .addString("parameter1", "value1")
  .addString("parameter2", "value2")
  .toJobParameters());

The good news is that you can refer to these parameters in your job configuration, which comes in handy for the DecompressTasklet and FlatFileItemReader beans, as shown in the following listing.

Listing 8 Referring to job parameters in the Spring configuration

<bean id="decompressTasklet" 
      class="com.manning.sbia.ch01.batch.DecompressTasklet" 
      scope="step"> 
<property name="inputResource"
value="#{jobParameters['inputResource']}" /> <property name="targetDirectory" value="#{jobParameters['targetDirectory']}" /> <property name="targetFile" value="#{jobParameters['targetFile']}" /> </bean> <bean id="reader" class="org.springframework.batch.item.file.FlatFileItemReader" scope="step">
<property name="resource" value="file:#{jobParameters['targetDirectory'] [CA] +jobParameters['targetFile']}" />

To be able to refer to job parameters, a bean must use the Spring Batch step scope (#1). The step scope means that Spring will create the bean only when the step asks for it and that values will be resolved then (this is the lazy instantiation pattern; the bean isn't created during the Spring application context's bootstrapping). To trigger the dynamic evaluation of a value, you must use the #{expression} syntax. The expression must be in SpEL, which is available as of Spring 3.0 (Spring Batch falls back to a less powerful language if you don't have Spring 3.0 on your class path). The jobParameters variable behaves like a Map. That's how you refer to the inputResource, targetDirectory, and targetFile job parameters (#2). Note that you're not limited to plain references; you can also use more complex expressions; for example, notice how the target directory and file are concatenated for the resource property.

You're done with the configuration: the job and infrastructure are ready, and part of the configuration can come from job parameters, which are set when you launch the job. It's time to write the test for your batch process.

Writing the test for the job

You use good old JUnit to write the test, with some help from the Spring testing support. The following listing shows the integration test for the job.

Listing 9 Integration test for the import product test

package com.manning.sbia.ch01.batch; 

(...)

@RunWith(SpringJUnit4ClassRunner.class)
@ContextConfiguration(locations={ 
  "/import-products-job-context.xml", 
  "/test-context.xml"
})
public class ImportProductsIntegrationTest {

  @Autowired
  private JobLauncher jobLauncher;

  @Autowired 
  private Job job;

  @Autowired
  private JdbcTemplate jdbcTemplate;

  @Before
  public void setUp() throws Exception { 
    jdbcTemplate.update("delete from product"); 
jdbcTemplate.update("insert into product "+ "(id,name,description,price) values(?,?,?,?)", "PR....214","Nokia 2610 Phone","",102.23 ); } @Test public void importProducts() throws Exception { int initial = jdbcTemplate.queryForInt("select count(1) from product"); jobLauncher.run(
job, new JobParametersBuilder() .addString("inputResource", "classpath:/input/products.zip") .addString("targetDirectory", "./target/importproductsbatch/") .addString("targetFile", "products.txt") .addLong("timestamp", System.currentTimeMillis()) .toJobParameters() ); int nbOfNewProducts = 7; Assert.assertEquals(
initial+nbOfNewProducts, jdbcTemplate.queryForInt( "select count(1) from product") ); } }

The test uses the Spring TestContext Framework, which creates a Spring application context during the test and lets you inject some Spring beans into the test (with the @Autowired annotation). The @RunWith and @ContextConfiguration trigger the Spring TestContext Framework. At #1, you clean and populate the database. This creates a consistent database environment for each @Test method. At #2, you launch the job with its parameters and check at #3 that the job correctly inserted the products from the test ZIP archive. The test ZIP archive doesn't have to contain thousands of records: it can be small so the test runs quickly.

You can now run the test with your favorite IDE (Eclipse, IDEA) or build tool (Maven, Ant). Figure 8 shows the result of the test execution in Eclipse.

Figure 8 Launching the test in Eclipse. Despite all its features, Spring Batch remains lightweight, making jobs easy to test.

That's it! You have a reliable integration test for your batch job. Wasn't it easy? Even if the job handles hundreds of thousands of records daily, you can test in an IDE in a couple of seconds.

NOTE A common requirement is launching jobs from the command line.

The job works, great, but batch applications aren't common pieces of software: they must be bulletproof. What happens if the input file contains a badly formatted line? Could you live with your job crashing because of an extra comma? The next section covers how Spring Batch lets you skip incorrect lines instead of failing.

Skipping incorrect lines instead of failing

We listed the requirements for batch applications, including robustness. The import product job isn't robust yet: for example, it crashes abruptly if only a single line of the flat file is formatted incorrectly. The good news is that Spring Batch can help make the job more robust by changing the configuration or by implementing simple interfaces.

We will show you how to handle unexpected entries when you're reading data. By the end of this section, the import product job will be more robust and you'll have a better understanding of how Spring Batch can help improve robustness in general.

On a good day, the import product job will decompress the input archive, read each line of the extracted flat file, send data to the database, and then exit successfully. As you know, if something can go wrong, it will. For instance, if the FlatFileItemReader fails to read a single line of the flat file—because it's incorrectly formatted, for example—the job immediately stops. Perhaps this is acceptable behavior, but what if you can live with some invalid records? In this case, you could skip an invalid line and keep on chugging. Spring Batch allows you to choose declaratively a skip policy when something goes wrong. Let's apply a skip policy to your job's import step.

Suppose a line of the flat file hasn't been generated correctly, like the price (in bold) of the third product in the following snippet:


PRODUCT_ID,NAME,DESCRIPTION,PRICE PR....210,BlackBerry 8100 Pearl,,124.60
PR....211,Sony Ericsson W810i,,139.45
PR....212,Samsung MM-A900M Ace,,97,80
PR....213,Toshiba M285-E 14,,166.20

The format of the price field of the third record is incorrect: is uses a comma instead of a period as the decimal separator. Note that the comma is the field separator Spring Batch uses to tokenize input lines: the framework would see five fields where it expects only four. The FlatFileItemReader throws a FlatFileParseException and, in the default configuration, Spring Batch immediately stops the process.

Assuming you can live with skipping some records instead of failing the whole job, you can change the job configuration to keep on reading when the reader throws a FlatFileParseException, as shown in the following listing.

Listing 10 Setting the skip policy when reading records from the flat file

<job id="importProducts" 
     xmlns="http://www.springframework.org/schema/batch">
  <step id="decompress" next="readWriteProducts">
    <tasklet ref="decompressTasklet" />
  </step>
  <step id="readWriteProducts">
    <tasklet>
      <chunk reader="reader" writer="writer" commit-interval="100" 
          skip-limit="5"> 
<skippable-exception-classes>
<include class="org.springframework.batch. [CA] item.file.FlatFileParseException" /> </skippable-exception-classes> </chunk> </tasklet> </step> </job>

The skip policy is set in the chunk element. The skip-limit attribute (#1) is set to tell Spring Batch to stop the job when the number of skipped records in the step exceeds this limit. Your application can be tolerant, but not too tolerant! Then, the exception classes that trigger a skip are stated (#2). Here, we want to skip the offending line when the item reader throws a FlatFileParseException.

You can now launch the job with an input file containing incorrectly formatted lines, and you'll see that Spring Batch keeps on running the job as long as the number of skipped items doesn't exceed the skip limit. Assuming the ZIP archive contains incorrect lines, you can add a test method to your test, as shown in the following listing.

Listing 11 Testing the job correctly skips incorrect lines with a new test method

@RunWith(SpringJUnit4ClassRunner.class)
@ContextConfiguration(locations={"/import-products-job-context.xml","/testcontext.xml"})
public class ImportProductsIntegrationTest { 
  (...)
  @Test
  public void importProductsWithErrors() throws Exception {
    int initial = jdbcTemplate.queryForInt("select count(1) from product");

    jobLauncher.run(job, new JobParametersBuilder()
      .addString("inputResource", 
        "classpath:/input/products_with_errors.zip")
      .addString("targetDirectory", "./target/importproductsbatch/")
      .addString("targetFile","products.txt")
      .addLong("timestamp", System.currentTimeMillis())
      .toJobParameters()
    );
    int nbOfNewProducts = 6; 
    Assert.assertEquals(
      initial+nbOfNewProducts,
      jdbcTemplate.queryForInt("select count(1) from product")
    );
  }

}

Note that this code doesn't do any processing when something goes wrong, but you could choose to log that a line was incorrectly formatted. Spring Batch also provides hooks to handle errors.

This completes the bullet-proofing of the product import job. The job executes quickly and efficiently, and it's more robust and reacts accordingly to unexpected events such as invalid input records.

Summary

That's it—you've implemented a full-blown job with Spring Batch! This shows how Spring Batch provides a powerful framework to create batch jobs and handles the heavy lifting like file I/O. Your main tasks were to write a couple of Java classes and do some XML configuration. That's the philosophy: focus on the business logic and Spring Batch handles the rest.

Here are some other Manning titles you might be interested in:


Spring in Action, Third Edition

Spring in Action, Third Edition
Craig Walls

Spring in Practice

Spring in Practice
Willie Wheeler, John Wheeler, and Joshua White

Spring Integration in Action

Spring Integration in Action
Mark Fisher, Jonas Partner, Marius Bogoevici, and Iwein Fuld