||This article includes a list of references, but its sources remain unclear because it has insufficient inline citations. Please help to improve this article by introducing more precise citations.|
Batch processing is the execution of a series of jobs in a program on a computer without manual intervention (non-interactive). Strictly speaking, it is a processing mode: the execution of a series of programs each on a set or "batch" of inputs, rather than a single input (which would instead be a custom job). However, this distinction has largely been lost, and the series of steps in a batch process are often called a "job" or "batch job".
Batch processing has these benefits:
- It can shift the time of job processing to when the computing resources are less busy.
- It avoids idling the computing resources with minute-by-minute manual intervention and supervision.
- By keeping high overall rate of utilization, it amortizes the computer, especially an expensive one.
- It allows the system to use different priorities for interactive and non-interactive work.
- Rather than running one program multiple times to process one transaction each time, batch processes will run the program only once for many transactions, reducing system overhead.
It also has multiple disadvantages, for instance users are unable to terminate a process during execution, and have to wait until execution completes.
The term "batch processing" originates in the traditional classification of methods of production as job production (one-off production), batch production (production of a "batch" of multiple items at once, one stage at a time), and flow production (mass production, all stages in process at once).
Early history (19th century through 1960s)
Batch processing dates to the late 19th century, in the processing of data stored on decks of punch card by unit record equipment, specifically the tabulating machine by Herman Hollerith, used for the 1890 United States Census. This was the earliest use of a machine-readable medium for data, rather than for control (as in Jacquard looms; today control corresponds to code), and thus the earliest processing of machine-read data was batch processing. Each card stored a separate record of data with different fields: cards were processed by the machine one by one, all in the same way, as a batch. Batch processing continued to be the dominant processing mode on mainframe computers from the earliest days of electronic computing in the 1950s.
Originally machines only tabulated data, counting records with certain properties, like "male" or "female". In later use, separate stages or "cycles" of processing could be done, analogous to the stages in batch production. In modern data processing terms, one can think of each stage as an SQL clause, such as SELECT (filter columns), then WHERE (filter cards, or "rows"), etc. The earliest machines were built (hard-wired) for a single function, while from 1906 they could be rewired via plugboards, and electronic computers could be reprogrammed without being rewired. Thus early multi-stage processing required separate machines for each stage, or rewiring a single machine after each stage. Early electronic computers were not capable of having multiple programs loaded into main memory (multiprogramming), and thus while they could process multiple stages on a single machine without rewiring, the program for each stage had to be loaded into memory, run over the entire batch, and then the program for the next loaded and run.
There were a variety of reasons why batch processing dominated early computing. One reason is that the most urgent business problems for reasons of profitability and competitiveness were primarily accounting problems, such as billing or payroll; this priority of accounting in early use of information technology is ancient: see history of writing and history of accounting. Billing may conveniently be performed as a batch-oriented business process, and practically every business must bill, reliably and on-time. While accounting is time-sensitive, it can be done daily (particularly overnight, after close of business), and does not require interaction. Also, every computing resource was expensive, so sequential submission of batches on punched cards matched the resource constraints and technology evolution at the time.
Batch data processing took advantage of the economies of scale in sorting and processing sequential data storage media, such as punch cards and, later, magnetic tape. Typically transactions for a recording period, such as a day or a week, would be entered onto cards from paper forms using a keypunch machine. At the close of the period, the data would be sorted using a card sorting machine, or, later a computer. The sorted data would then be used to update a master file, such as an accounting ledger or inventory file, that was kept sorted by the same key. Only one pass through the sequential files would be needed for the updates. Reports and other outputs, such as bills and payment checks, would then be generated from the master file.
In addition to batches of data, early electronic computers could also run one-off computations, notably compilation (see history of compiler construction). These were accordingly referred to as jobs, as they were one-off processing, and were controlled by languages such as Job Control Language (JCL). However, this distinction between jobs and batches later became blurred with the advent of interactive computing.
Later history (1960s onwards)
From the late 1960s onwards, interactive computing such as via text-based computer terminal interfaces (as in Unix shells or read-eval-print loops), and later graphical user interfaces became common. Non-interactive computation, both one-off jobs such as compilation, and processing of multiple items in batches, became retrospectively referred to as batch processing, and the oxymoronic term batch job (in early use often "batch of jobs") became common. Early use is particularly found at the University of Michigan, around the Michigan Terminal System (MTS); examples from 1968 and 1969:
Non-interactive computation remains pervasive in computing, both for general data processing and for system "housekeeping" tasks (using system software). A high-level program (executing multiple programs, with some additional "glue" logic) is today most often called a script, and written in scripting languages, particularly shell scripts for system tasks; however, in DOS this is instead known as a batch file. That includes UNIX-based computers, Microsoft Windows, Mac OS X (whose foundation is the BSD Unix kernel), and even smartphones. A running script, particularly one executed from an interactive login session, is often known as a job, but that term is used very ambiguously.
Batch processing narrowly speaking (processing multiple records through stage, one stage at a time) is still pervasive in mainframe computing, but is less common in interactive online networked systems, particularly in client–server systems such as the request–response messages of web servers. These systems instead function as flow processing, where for each task messages are passed between servers, all servers working at once on different stages of different tasks. Even in non-networked settings, flow processing is common, specifically as pipelines of connected processes, concurrently processing like an assembly line.
Where batch processing remains in use, the outputs of separate stages (and input for the subsequent stage) are typically stored as files. This is often used for ease of development and debugging, as it allows intermediate data to be reused or inspected. For example, to process data using two program
step2, one might get initial data from a file
input, and store the ultimate result in a file
output. Via batch processing, one can use an intermediate file,
intermediate, and run the steps in sequence (Unix syntax):
step1 < input > intermediate step2 < intermediate > output
This batch processing can be replaced with a flow: the intermediary file can be elided with a pipe, feeding output from one step to the next as it becomes available:
step1 < input | step2 > output
Batch applications are still critical in most organizations in large part because many common business processes are amenable to batch processing. While online systems can also function when manual intervention is not desired, they are not typically optimized to perform high-volume, repetitive tasks. Therefore, even new systems usually contain one or more batch applications for updating information at the end of the day, generating reports, printing documents, and other non-interactive tasks that must complete reliably within certain business deadlines.
Some applications are amenable to flow processing, namely those that only need data from a single input at once (not totals, for instance): start the next step for each input as it completes the previous step. In this case flow processing lowers latency for individual inputs, allowing them to be completed without waiting for the entire batch to finish. However, many applications require data from all records, notably computations such as totals. In this case the entire batch must be completed before one has a usable result: partial results are not usable.
Modern batch applications make use of modern batch frameworks such as Jem The Bee, Spring Batch or implementations of JSR 352 written for Java, and other frameworks for other programming languages, to provide the fault tolerance and scalability required for high-volume processing. In order to ensure high-speed processing, batch applications are often integrated with grid computing solutions to partition a batch job over a large number of processors, although there are significant programming challenges in doing so. High volume batch processing places particularly heavy demands on system and application architectures as well. Architectures that feature strong input/output performance and vertical scalability, including modern mainframe computers, tend to provide better batch performance than alternatives.
Scripting languages became popular as they evolved along with batch processing.
A batch window is "a period of less-intensive online activity", when the computer system is able to run batch jobs without interference from online systems.
Many early computer systems offered only batch processing, so jobs could be run any time within a 24-hour day. With the advent of transaction processing the online applications might only be required from 9:00 a.m. to 5:00 p.m., leaving two shifts available for batch work, in this case the batch window would be sixteen hours. The problem is not usually that the computer system is incapable of supporting concurrent online and batch work, but that the batch systems usually require access to data in a consistent state, free from online updates until the batch processing is complete.
In a bank, for example, so-called end-of-day (EOD) jobs include interest calculation, generation of reports and data sets to other systems, printing statements, and payment processing. This coincides with the concept of Cutover, where transaction and data are cut off for a particular day's batch activity and any further data is contributed to the next following day's batch activity (this is the reason for messages like "deposits after 3 PM will be processed the next day").
The batch window is further complicated by the actual run-time of a particular batch activity. Some batches in banking can take between 5-9 hours of run time, coupled with global constraints some batch activity is broken up or even stalled to allow periodic use of databases mid batch (usually in read-only) to support automated testing scripts that may run in the evening hours or outsourced\contract testing and development resources abroad. More complex problems arise when institutions both have batch activities that may be dependent meaning both batches have to complete in the same batch window.
As requirements for online systems uptime expanded to support globalization, the Internet, and other business requirements the batch window shrank and increasing emphasis was placed on techniques that would require online data to be available for a maximum amount of time.
Common batch processing usage
Batch processing is also used for efficient bulk database updates and automated transaction processing, as contrasted to interactive online transaction processing (OLTP) applications. The extract, transform, load (ETL) step in populating data warehouses is inherently a batch process in most implementations.
Batch processing is often used to perform various operations with digital images such as resize, convert, watermark, or otherwise edit image files.
Batch processing may also be used for converting computer files from one format to another. For example, a batch job may convert proprietary and legacy files to common standard formats for end-user queries and display.
Notable batch scheduling and execution environments
The Unix programs
batch is a variant of
at) allow for complex scheduling of jobs. Windows has a job scheduler. Most high-performance computing clusters use batch processing to maximize cluster usage.
The IBM mainframe z/OS operating system or platform has arguably the most highly refined and evolved set of batch processing facilities owing to its origins, long history, and continuing evolution. Today such systems commonly support hundreds or even thousands of concurrent online and batch tasks within a single operating system image. Technologies that aid concurrent batch and online processing include Job Control Language (JCL), scripting languages such as REXX, Job Entry Subsystem (JES2 and JES3), Workload Manager (WLM), Automatic Restart Manager (ARM), Resource Recovery Services (RRS), DB2 data sharing, Parallel Sysplex, unique performance optimizations such as HiperDispatch, I/O channel architecture, and several others.