on Wednesday 20 July 2011
in Commodore 64 > Programming

How to get a fast 2x2-FLI routine

by Wolfram Sang (Ninja/The Dreams - www.the-dreams.de)

Shortly before X2004, Oswald/Resource asked me if I could do a 2x2-FLI
routine, which is fast enough to have some extra cycles for the main
routine while displaying the FLI. That was an interesting problem, so
maybe the things I came up with may be inspiring for you, too. To see
the routine in action, have a look for the demo "REAL" by Resource and
The Dreams. If it wasn't for such a routine, no one would probably have
dared to do a tunnel and the julia-effect in such a resolution, because
it would have been awfully slow. Okay, now enjoy the article!


First of all, be aware of the terms "interrupt", "IRQ" and "NMI". When I
say "interrupt", I mean interrupts in general. If I say "IRQ" and "NMI"
I mean this specific interrupt. We need to have this strict.
Furthermore, it will be helpful if you know (at least in theory)
stuff like how to get a stable raster using a VIC-IRQ and a timer. You
should not be afraid of timers, in general.

The task

We need to do FLI on every 3rd, 5th and 7th line of every charline
(counting from 1). The 1st line is handled by the VIC automatically, of
course. Doing this full-screen, we will end up with 75 interrupts
per frame altogether! That leads to the conclusion that the interrupt
doing the FLI must be as fast as somewhat possible. Every cycle saved
here gives us 75 additional cycles per frame for the main routine. So
what can be done?

Double Timer

Of course, for doing FLI we need a stable raster. The fastest way of
doing this is using the double timer method. Look at 4x4-routines which
use the VIC-IRQ. As this interrupt occurs always at the beginning
of a rasterline, they usually need some NOPs to throw away cycles until
the correct position for FLI is reached. Horrible! Using a double
timer, we can set the beginning of the interrupt anywhere we want. Okay,
measuring the correct position in the init-routine can be nasty at the
beginning, but keep at it, the result pays off. Of course, here we set
the interrupt in a way we reach the FLI-position "just in time". Keep
in mind that you always have to use a CIA-detection routine, when using
timers for stable rasters. New CIAs initiate an IRQ one cycle earlier,
not taking care of this can lead to crashes. You can find an example in
my source-code, if needed.

Using NMI

To get the desired three interrupts per charline, we can use two timers.
One fires every 4th rasterline, doing FLI on the 3rd and 7th line of a
charline. The other timer fires every 8th rasterline, doing FLI on the
5th line of a charline. If we now use the timers of CIA2, which trigger
the NMI, we have an elegant way to keep boundary checks out of our
FLI-routine. Again, check some 4x4-routines which often check inside
the FLI-IRQ if a certain rasterline has been reached, so displaying FLI
has to stop. We can do it differently now: We use VIC-IRQs to
allow/forbid the timers of CIA2 to trigger NMIs, which is equal to
start/stop displaying FLI. So, the boundary checks are within two tiny
VIC-IRQs instead of the 75-times-per-frame-called NMIs.

Making stable

Okay, now we already gained some cycles between two NMIs, but we need
more. What is left to optimize? The routine to get a stable raster. You
probably know routines like this:

EOR #$0F
STA self_mod+1
bpl *

They take around 22 cycles at maximum to get a stable raster, so here is
a lot to win. Though, we have to pay a price. Usually, if you want a
*very* fast routine, you need to sacrifice memory (and vice versa). So,
for a significant speed-up we need 8 pages of memory. How is this
achieved? We will use another timer to tell us the position within
a rasterline. The trick is now to use the result of this timer as a part
of a jump-instruction, so for every value in this timer, there will be
an appropriate interrupt-routine.

In detail, the timer at $DC06/7 runs from $003e down to $0000
(rasterline-x-position). The timer at $DC04 does not run and will just
serve as simple memory. It stores the jump-opcode $4C and the low-byte of
the desired NMI-routine. Now we set the NMI-vector to $DC04 and according
to the value in $DC06 this or that or the other FLI-routine is used.

As the jitter can be 8 cycles, we need to have 8 pages of NMI-routines.
We need to have 3 different interrupts per charline, so every page must
have 3 appropriate NMIs to make FLI in the desired rasterline. That
leads to 8*3=24 NMI routines in those 8 pages. When using interlace
inside the same VIC-bank, this value doubles to 48! So, it will get
quite messy in that memory, but it is fast.

As every jitter-value gets its own NMI-routine, we have another
small bonus. We don't have to use NOPs to clean the jitter, we can use
"sensible" opcodes. For example, if the jitter is at least 4 cycles, you
could acknowledge the NMI by using BIT $DD0D before the FLI takes place.
If the jitter is below 4 cyles, you simply do it after the FLI. In both
cases, you didn't waste the cycles.

As a result, this version to make a stable raster just needs 6 cycles in
the worst case (3 for the JMP and 3 to clean the maximum jitter).
Comparing this to the 22-cycle-routine before, this is a gain of 16
cycles * 75 interrupts/frame = another 1200 cycles/frame. Yeah!

The outcome

For the best case, the FLI-routine now looks as simple as this:

DC04: JMP $xxxx

sta nmi_a ; save accu
lda #d018val
sta $d018
lda #d011val
sta $d011 ; do FLI
bit $dd0d ; acknowledge NMI
lda #next_nmi ; low byte of next NMI-handler
sta $dc05 ; set it
lda nmi_a ; get accu
rti ; exit from NMI

Not much left to gain anymore. The routines in the other pages look very
similar, of course, just with added cycles for the jitter. The other
routines in the same page have simply other values for $D018 and $D011
(and one has to set them twice to activate the first line of a charline,
of course). These routines should give you about 40% of the cycles
back, compared to doing FLI all the time. If your effect does not use
full width, you could put it on the right side of the screen and
initiate the FLI some cycles later. If you just use the right half, you
will be at 50%.

Ninja version

I did an implemenation which is ready to use (...please, give
credits, blabla and such... well, this routine is easy to identify,
anyway). It implements the aformentioned ideas plus some more. For
example, it saves some more cycles by using the same value for $D011
and the low bytes of the NMI handlers (skipping lda #next_nmi from
above). Another neat thing is the use of the y-register inside the
NMI-routines instead of the accumulator. As we now just need load and
store instructions, this is easily possible. The benefit is, that the
next opcode after the FLI is now always STY ($8C). So, the FLI-bug has
light grey as screen-ram color, and grey as color-ram color. That is a
color combination you can atleast work with a little (did somebody
notice the anti-aliased logo next to the julia-routine in "REAL"?). A
little bonus is that it opens the upper/lower-border for free, so to

You will find two different versions on this disk. One for the
2x2-mode without interlacing. One for 2x2 with standard interlacing.
All these routines need one zeropage-location (default = $02). The 8 pages
containing the NMI-routines reside from $0800-$0FFF (the files itself
are a bit shorter). Keep in mind that relocating them means also
re-adjusting the NMI-timers, because they must invoke NMIs only when the
timer at $DC06 gives the correct high byte for the JMP-instruction! The
code in these 8 pages is already terribly fragmented, that is why I
fiddled the init-routine for the 2x2-mode inbetween the gaps. So, you
don't lose another 2 pages for that, at least. To use the 2x2-modes,
simply JSR $0CDC and you are done. They don't initialize $d016 and
$dd00, that remains your job. They do set $01 to $35, however. In the
IRQ-routine for the lower-border, you can find around $0F00 a BIT $1003,
which you can easily change either to a music-call or to your own
subroutine, in case you need something done once a frame. For more
advanced changes, I strongly recommend using the source-code (to be
assembled with "AS"). Even there, I must say it is
pretty easy to spoil things. Think at least twice before making
changes other than changing the options at the beginning! For a maximum
of flexibility you won't come around doing your own version, anyway (not
that you wouldn't know that). Still, I hope my source serves as
educational material even without comments. Well, you have this text as
a guide and you can write me an email if you have further questions, or
need a special version of it or so. You are hereby encouraged to
write me, if you have comments or ideas for further improvements!

That should be all for now. I hope this article was a little enriching
to you. To be complete, I was not the first who used the timers as part
of the opcode (and I never claimed to be). At least Kjer/Horizon used a
JMP ($DC03) in the Demo "A Load of old [censored]". Still, I developed the routine
and ideas from scratch by myself and I am quite proud of it. Okay then,
happy hacking and keep the spirit!

=== uuencoded binary

begin 644 2x2-fli.zip