The FIR calculation is very common to DSP, to the extent special achitectures are used to optimise calculation efficiency. Indeed some DSP incorporate co-processors for filter calculations.
All DSP of any worth incorporate a specific algorithm to process FIR filter taps. The example given here is specifically for the Analog SHARC series of 32 bit floating point processors using the Harvard architecture. Essentially, 4 separate operations related to a FIR tap calculation are carried out in a single clock cycle. The first two operations are a multiply and add then two pieces of data are retrieved from memory in preparation for the next multiply and add operation.
FIR: is the loop tag. A loop needs setting up so the algoritm is repeated for each FIR filter tap. Prior to the loop the parameters and starting addresses will need setting up. A post clean up of pending calculations may be required.
The sample data is held in f0 and the FIR filter coefficient data in f4. f0 and f4 are register variables. In order to recover these two pieces of data simultaneously two different areas of memory are used. In the Harvard architecture the memory is divided into data memory and program memory with the program memory able to store dat as well as the program. This allows simultaneous recover of data and program instruction or two seperate pieces of data. Here dm(..) recovers from the data memory and pm(..) from the program memory. f0=dm(i0, mo) means recover the data at the data nmemory address i0, place in f0 and increment the address i0 by m0. All operations in the register. Likewise f4=pm(i9,m9) means recover the data at the program memory address i9, place in f4 and increment i9 by m9.
Note: f0 and f4 are in different register sets to avoid a register access clash. Before starting the loop the registers f0 and f4 have to be primed with the initial values. The output accumulator f8 and f12 the multiplication accumulator are initialised at 0.
Thus the two pieces of data f0 and f4 are multiplied to generate the latest multiplier accumulator. The f12 from the previous tap (loop cycle) is added to the output accumulator f8.
Finally on termination of the loop, f0 and f4 have the last pieces of data with a multiplication and an addition still pending. This post processing is done outside the loop, or let the loop roll on two more cycles by ensuring the next two data elements in dm and pm are harmless. ie: pad the sample and coefficient data arrays with two empty elements.