Thursday, November 3, 2016

Shinobi Shader System

Shaders for my Shinobi engine are written in Lua programming language. Why Lua? Lua with its dynamic typing and first class functions makes it easy to generate shader permutations -- a hard problem to solve in a C like language with limited preprocessor power. Secondly, a custom shader compiler can target different backends such as HLSL and GLSL automatically.

Below is a simple tonemapping shader from Shinobi which shows the basic syntax (which is actually unmodified Lua syntax). The shader is actually a shader bundle because it generates four different versions of the tonemapping shader when executed by returning a table of shaders.
import "common.lua" 
texture_2d "source_buffer" : slot(0) 
uniform_block "tonemap" : slot(3)
: float "exposure"
: float "saturation" 
function tonemap_linear(color)
return color
end 
function tonemap_exponential(color)
return float3(1.0, 1.0, 1.0) - exp(-color)
end 
function tonemap_reinhard(color)
local white = 0.8
color = color * (1 + color / (white*white)) / (1.0 + color)
return color
end 
function tonemap_filmic(color)
local A = 0.15
local B = 0.50
local C = 0.10
local D = 0.20
local E = 0.02
local F = 0.30
local W = 11.2
local exposure_bias = 2.0
local v = color * exposure_bias
local color = ((v * (A * v + C * B) + D * E) / (v * (A * v + B) + D * F)) - E / F
local white = ((W * (A * W + C * B) + D * E) / (W * (A * W + B) + D * F)) - E / F
color = color / white
return color
end 
function tonemap(func)
local color = tex_load(source_buffer, sv_screen_pos()).xyz
color = color * exposure
color = func(color)
color = pow(color, 1.0/2.0)
color = saturation(color, saturation)
out.color = float4(color, 1.0)
end 
local shaders = {}
for _,func in ipairs{"linear", "exponential", "reinhard", "filmic"} do
shaders[func] = link_shader(compile_ps(tonemap, _G["tonemap_"..func]))
end 
return shaders

Another neat feature is automatic generation of shader input and output declarations between shader stages. For example, a simple mesh shader with support for optional skinning could look something like the following. Note how the fetch_xxx() functions automatically collect attributes for shader input and output declarations and check that the shader signatures match.

import "common.lua" 
function fetch_mesh_vertex()
local position = fetch_float4("position")
local normal = fetch_float4("normal")
local tangent = fetch_float4("tangent")
local texcoord = fetch_float2("texcoord")
local v = {}
v.position = position
v.normal = (normal.xyz - float3(0.5, 0.5, 0.5)) * 2.0
v.tangent = (tangent.xyz - float3(0.5, 0.5, 0.5)) * 2.0
v.bitangent = cross(v.normal, v.tangent) * ((tangent.w - 0.5) * 2.0)
v.texcoord = texcoord
return v
end 
function fetch_skinned_vertex()
local v = fetch_mesh_vertex()
v.bone_indices = fetch_int4("bone_indices")
v.bone_weights = fetch_float4("bone_weights")
return v
end 
function vs(skinning)
local v
if skinning then
v = fetch_skinned_vertex()
v = skin_transform(v)
else
v = fetch_mesh_vertex()
end
out.sv_position = transform_vec(v.position, mvp_matrix)
out.texcoord = v.texcoord
end 
function ps()
local texcoord = fetch_float2("texcoord")
out.color = tex_sample(diffuse_map, diffuse_sampler, texcoord)
end 
return {
static_mesh = link_shader(compile_vs(vs, false), compile_ps(ps)),
skinned_mesh = link_shader(compile_vs(vs, true), compile_ps(ps)),
}

The shader compiler is written in Lua. The compiler is a Lua environment with overloaded math operators, type constructors and intrinsics loaded in. The shader code is executed in this environment which generates the output shader code in one pass. Currently the compiler has backends for HLSL and GLSL, less than 200 lines of Lua code each. The compiler itself is around 2300 lines including function library for intrinsics. The most complex part of the compiler is typechecker with does full typechecking for function and operator args so that if there is an error I get a Lua stacktrace with line numbers pointing to the original Lua shader code instead of some confusing errors in the generated HLSL/GLSL code.

Of course there are some drawbacks:

1. Because of dynamic typing you can't clearly see the types of variables by looking at the shader code. Personally this is not a big deal to me as I'm used to dynamic typing (I have written a lot of Lua code). I can always sprinkle the code with type assertions or use some form of Hungarian notation for variable names if I wanted.

2. Dynamic branching can't be expressed as Lua statements because Lua does not have a feature for overloading statement like syntax. The solution I'm currently using is to use functional (Lisp) style. For example, ifs are implemented using Lambdas like this: _if(expression, true-lambda, false-lambda). Lua has sensible scoping rules for variables and full support for lambdas so this is not that bad. A nicer C-like syntax would be possible by forking Lua codebase and adding some language extension but I haven't had time to look at this yet. Luckily the Lua interpreter code is well structured and very lightweight so making a language extension shouldn't be that hard.

3. For similar reasons logical and comparison operators are hardwired in Lua and can't be overloaded for code generation. Again this can be worked around with Lisp style syntax. For example, "mag.x > mag.y && mag.x > mag.z" becomes "_and(greater(mag.x, mag.y), greater(mag.x, mag.z))". C like syntax should be possible by hacking the Lua interpreter. The idea would be to add full metamethod support for these operators and disable short circuit evaluation of logical operators.

Friday, May 27, 2016

Hacked kid toy


Wow, a year has passed with no updates! Let's fix that right now.

Sometime last fall my son got a kid's toy which would play this really annoying 10 second bleep bloop loop (anybody with kids know what I'm talking about). After torturing the family with it for a few weeks I decided drastic measures would be in order. So I devised a plan to replace the electronics inside the toy with a SD card player of my custom design.

Basically the design consist of a SD card adapter, an ATmega328P microcontroller (I love using these chips), R-2R resistor DAC, op amp for buffering and low pass filtering and a TDA7052A power amplifier (see the schematic at the end for details). The design itself is quite simple, but making the PCB for this one was quite tricky because I'm making my own PCBs and they have to be single-sided. Also in this case, the space inside the toy was quite tiny and the PCB had a lot of rounded shapes.

Instead of a lengthy write up I'm going to present this project as a series of images. Here we go!

The toy opened showing the original PCB
Paper prototype of the new PCB. I wanted to be extra sure that everything fits
nicely inside the cramped enclosure.
Exposure mask printed on a transparent sheet

UV exposure process in progress. I made the UV exposure box myself a few years ago.

Etched PCB before cutting. I had to do some manual fixes using a pen before
etching because I made some errors during UV exposure process :)

Cut, drilled and painted PCB. It fits, huzzah!
I used a Dremel to cut the board. I had this idea of painting the PCB with a matte grey spray
and I really like the result so I'm probably going to use it in future projects too.

PCB with all components soldered in place. There was no room for the LED and
SD card adapter (right) so they are on separate boards sandwitched on top of the main PCB.

Schematic made in Eagle CAD

Finally here's how it looks and sounds in action. The 8 GB micro SD card is currently loaded with 50 songs with plenty of space left for more songs! Anyone recognize the song? :)


Monday, December 8, 2014

ERIC-1: Video Memory Interface

The ATmega1284P coprocessor of ERIC-1 has been able to output a PAL video signal for some time. As you may know, the coprocessor has a screen buffer of 50x32 characters and a 2KB character rom table storing the glyphs of the character set. All this data is stored in ATmega's internal SRAM and so far there has been no communication between the 6502 and the coprocessor. Thus the contents of the screen has been fixed.

Recently I have been working on getting the 6502 talk with the coprocessor, so that a region in 6502's address space is mapped to characters on the screen. In this blog post I'll review some alternative ideas that I considered before settling on the final design.

Time is of the essence


The ATmega has 16 kilobytes of internal SRAM memory which is really fast (it can read or write a byte in 2 cycles), but there is no way to read data fast enough from an external memory chip during PAL video generation. Therefore, the only possibility is transfer the needed bytes from the external SRAM chip to the internal SRAM during the scanlines when the ATmega is not outputting pixels. The screen mode I'm using has 256 lines of vertical resolution and a progressive PAL frame has a total of 312 lines, so luckily I have plenty of free scanlines to do the memory transfer. I figured the best time to do it is during the top border area, that is made of 32 blank scanlines before the visible image starts.

But it's not that simple. The ATmega can't just go and peek and poke to the memory anytime because the 6502 is executing and accessing the memory all the time. There are at least three ways to solve this problem. Firstly there are dual port memory chips that can deal with memory accesses from two sources at the same time. This kind of memory is more expensive and was not used in the 80s microcomputers. I felt that this design would not be in the spirit of the 80s micros and besides I already have my regular 628128 128K x 8 SRAM chip plugged in, so I dropped this idea.

The VIC-20 and C64 solved this problem cleverly by timesharing. The internal architecture of the 6502 is not pipelined and it can only access memory when its clock signal is high. The VIC-I and VIC-II graphics chips in VIC-20 and C64 take advantage of this and access the memory when the clock signal is low. This is a really neat because both chips can think that they own the memory all the time. The disadvantage of this approach is that the video chip and the 6502 are executing in lockstep. Basically the clock frequency of the 6502 of these systems is fixed to about 1 MHz and trying to change this will mess up the video chip timing badly. I considered implementing this idea and I think it could very well work. Since the ATmega is generating the clock signal, it knows in which state the 6502 clock is. Clocking the ATmega at 16 Mhz and 6502 at 1 MHz, the ATmega would have 16 cycles for every 6502 clock cycle. This could be just enough time to access the memory. Maybe something along the lines of this pseudo assembly routine could do the trick:

I haven't tried this approach yet, because it would be pretty much impossible to verify the timing without a logic analyzer or oscilloscope. Without exactly correct timing bad things will certainly happen.

But there is a third, much simpler way and this is what I ended up doing. The ATmega is generating the clock signal and it can halt the 6502 whenever it needs to access the memory. Now that I have upgraded the CPU, halting it is ridicuously simple. I just have to set the ATmega timer frequency to zero and the clock will stop in whatever state it was. Resuming the clock is as simple, I can just reset the timer frequency to whatever value it was. This solution has the nice property that the 6502 can be clocked independently from the video chip so 4 MHz system clock or even higher is no problem at all.

Here is the piece of code I'm using to halt and restart the CPU:

Block transfers


The memory transfers are implemented in the firmware in copymem128 routine. The routine halts the CPU, copies a 128 byte block from external SRAM to internal SRAM of the ATmega and resumes the CPU. The routine is called by the first 13 scanlines just after the vertical sync. In total 13*128 = 1664 bytes are copied, which is a few bytes larger than the screen ram. The screen ram used to contain pointers to character data, but I have changed the screen ram to contain character indices instead. This cuts the number of bytes to be copied over to half.

All the bytes copied are always on the same 256-byte page of RAM, so only the low byte of the address needs to be updated during the memory copy.

Here is the piece of code that copies the 128 bytes. I had to insert an extra nop in the loop, otherwise the data would not be copied correctly. Even without the nop, the latency should be within the specs of the 70ns SRAM, so I suspect that the breadboard must be caused problems here. I will try to optimize the nop away when I will eventually build this on the PCB.

Why 128 bytes? I would have hoped to copy an entire 256 byte page per scanline but unfortunately there is not enough time per scanline to copy an entire page. The ATmega has only 1024 cycles per scanline.

Test program


Finally the project is in a state where the 6502 can do something visible. To test the video memory interface I assembled a small 6502 program to update bytes in the screen memory area. It first clears the screen and then prints some text on the screen in a loop. Printing has been artificially slowed down by adding a delay loop because a 6502 running at 1 MHz is such a beast ;-).

Below is a video showing the output of the test program and the 6502 source code. Writing larger programs is going to be really tedious by manually typing in opcodes, I need to get a real assembler soon!

Thanks for reading! As always you can find the latest version of the source code at GitHub.







Sunday, December 7, 2014

ERIC-1: CPU Upgrade

I recently got a delivery of two brand new W65C02S chips from Coltek UK (£9 for two chips including shipping to Finland, not bad!). Now, if this didn't ring a bell, here's some news for you: 6502 microprocessors are still made even today. According to Western Digital Center (WDC), the owners of the 6502 intellectual property, hundreds of millions of 6502s are still made each year. Applications listed on their website includes scanners, toys, dashboards, industrial controllers and all sort of other embedded device, the list is long. Not bad for a CPU made over 30 years ago!

The processor chips I received are from these newer generations of 6502s made by WDC and they have some major improvements over the old Rockwell 6502 I had obtained earlier. First, the W65C02S version has a fully static design, meaning that it no longer loses the state of its internal registers if the clock is stopped. This makes single-stepping and halting the CPU much easier. I no longer have to wait for the clock and R/W to be high when stopping the CPU. Nice!

Also there is a new pin, the bus enable BE pin. When it is low, the address, data and R/W pins go to high impedance state (meaning they are essentially disconnected). This is a really handy feature that can be taken straight away into good use in ERIC-1. The W65C02S can also support clock frequencies up to 14 MHz (max for a Rockwell 6502 is 4 MHz). The breadboarded ERIC-1 probably can't sustain clock frequencies that high due to stray capacitance effects and long wires of the breadboard, but it's good to have that option when I will eventually build this on a PCB. WDC also has implemented a few new opcodes but I haven't taken a closer look at them yet.

The W65C02S is almost a direct replacement for the R65C02 but there are a few important details. The RDY pin is now bidirectional when it used to be only an input pin. There's a new instruction WAI that puts the RDY pin into output mode. Therefore it's important that this pin is not pulled up by connecting it directly to VCC or you could risk causing a short if the pin goes to output state. Instead a pull up resistor needs to be used. Well, I was already doing that so no problem. Another gotcha is the new function of pin 1, which used to be GND on Rockwell but it's now an output pin. According to the datasheet pin 1 is now labeled Vector Pull (VPB) which indicates that a vector location is being addressed during an interrupt sequence. I don't know what it is used for but better leave that unconnected.

With the new BE pin I was hoping to get rid of the 74HC541 buffers that I was using the detach the 6502 from the address bus when the coprocessor needs to access memory. I replaced the old Rockwell with a W65C02S and replaced the buffer chips with jumper wires. I also needed to invert the sense of the BE signal in the ATmega firmware: 74HC541 have OE which is active low, where as BE is active high on the W65C02S. I made the changes and everything seemed to work correctly.

After some time however I noticed a problem. The ATmega refused to be reprogrammed. I'm using an USBTiny programmer to update the ATmega firmware and it is connected to the SPI pins of the ATmega. The same pins are also mapped to I/O port B which is connected to the address bus on the 6502, so I suspected that there must be bus contention going on when the programmer is attempting to reprogram the chip but the 6502 is still driving the same lines for some reason. I disconnected the address lines on the SPI pins and sure the problem went away. This was really strange because the same setup used to work with the 74HC541 buffers. The W65C02S bus drivers must be somehow different than the 74HC4541 buffers or I must have made an error somewhere. It could be some sort of timing issue. According to datasheets the propagation delay for a '541 is typically 10ns and W65C02S BE was max delay of 30ns. Is this enough to make a difference? I doubt it. Anyway, I haven't yet been able to solve this mystery yet.

Even with the internal bus drivers of the WDC chip, one 74HC541 must remain for buffering the CE signal for the SRAM chip (when the ATmega accesses memory it needs to take over the SRAM CE signal and the simplest way to do this is to detach the CE from 6502 using a '541). As a workaround for the reprogramming issue, I routed three address lines through the same 74HC541 that is used to buffer the CE signal.

With these changes the WDC 6502 can now coexists happily with the ATmega1284P. With two chips gone the design now simpler but I'm still not entirely happy with the results. The strange issue with the firmware updates is still an unsolved mystery and routing the three address lines through the buffer feels like a kludge fix. The kind and wise folks of the 6502.org forum have given me some ideas to try to solve this mystery. I've also ordered a Saleae logic analyzer which should come in handy in debugging these kind of problems. I'll probably revisit this issue later armed with proper tools.

The new upgraded ERIC-1 with a W65C02S. Two 74HC541 chips from
the earlier design have been removed.

Updated schematic. The remaining 74HC541 has a dual duty: it takes care
of buffering the SRAM CE signal and also disconnects the three address
 lines A12-A14 when ATmega's firmware is updated.

Wednesday, December 3, 2014

ERIC-1: Bitbanging the video signal

I've been working on video signal generation for my ERIC-1 microcomputer lately. As you may know I built a 8-bit console in the past that generated a composite video signal using an ATmega328P microcontroller. The microcontroller outputted an 8-bit color value every 5th cycle which resulted in a pretty low resolution image. A DAC resistor network and a AD725 chip was used for RGB to PAL color conversion. For ERIC-1 I'm taking a bit different route, mainly because I want to get at least 40 characters per line on the screen and this requires higher resolutions than was possible in the console project.

Life and deeds of PAL video signal


A progressive PAL video signal is actually quite simple. A single PAL frame has 312 lines and the lines have the following structure. The first 5 lines indicate the start of a new frame and they provide the necessary vertical sync signals for the monitor to sync to. After that the next 304 lines contains the visible image, although some lines, typically the first 20 lines at the top and last 20 lines at the bottom, are clipped off by the monitor. The exact number of clipped lines depend on the monitor or TV. Finally after the visible image comes 3 lines that again contain vertical sync signals and tell the monitor to jump back to the top of the display.

Each PAL scanline is exactly 64us long. The sync lines are made of a series of long and short pulses. A long pulse is 30us low followed by 2us high state. A short pulse is 2us low followed by 30us high state. These pulses are used to generate the sync signals as follows:

Line
1 Long Pulse Long Pulse
2 Long Pulse Long Pulse
3 Long Pulse Short Pulse
4 Short Pulse Short Pulse
5 Short Pulse Short Pulse
6-309 Visible lines
310 Short Pulse Short Pulse
311 Short Pulse Short Pulse
312 Short Pulse Short Pulse

Every visible line starts with a horizontal sync pulse for the monitor. The HSYNC is 0V for 4.7us. The HSYNC is followed by a "back porch", which is 0.3V for 1.65us. In case of a color signal, a special color burst signal is generated during the back porch, but since we are at the moment dealing only with black and white images, we can skip this detail. After the back porch the remainder of the scanline contains luminosity data in range 0.3V (black) to 1V (white).

Since I'm using a ATmega1284P microcontroller which can only output digital values that are either 0V (low) and 5V (high), how can I generate the needed voltages? For black and white image, the needed voltages are 0V (HSYNC), 0.3V (black) and 1V (white). The crucial point to understand is that there is essentially a 75 ohm resistor inside the monitor which terminates the composite video signal to ground. This is called the input impedance and the value of 75 ohms is determined by the PAL standard. With this information it's simple to come up with the following circuit:

SYNC, VIDEO and GND coming from left, monitor on the right.

The 1K resistor and the 75 ohm "resistor" inside the monitor form a voltage divider. When the SYNC signal is high, the monitor receives the following voltage: 75 / (1000 + 75) * 5V = 0.35V. Similarly the 470 ohm and 75 ohm resistor form another voltage divider that sets the voltage level at the monitor input to 75 / (470 + 75) * 5V = 0.7V when the VIDEO signal is high. With different combinations of SYNC and VIDEO values we can generate the voltages 0V, 0.35V and 1.05V. Close enough to what we need!

The lost art of cycle counting


So, to generate a PAL frame we need to change the values of the two output pins SYNC and VIDEO very fast. These signals will get converted to proper voltage values by the two resistors. But how fast exactly do we need to change the pins, or "bitbang" them? Well, quite fast for a microcontroller running at 16 Mhz... A single scan line is 64us long and a MCU running at 16MHz has 16 clock cycles per microsecond. Therefore during a PAL scanline we have 64*16 = 1024 cycles. In 1024 cycles we have to generate the HSYNC pulse, the back porch pulse and the visible pixels. That means there's only time for a couple of clock cycles per pixel!

In the console project, I used a timer interrupt to trigger a routine every 64 microseconds. But interrupts have a rather large overhead on the time scale we are working with here: registers have to be restored and jumping to and back from the interrupt routine takes time. This time I decided to do this more efficiently. I have written the video signal generation entirely in assembly and explicitly cycle counted the code so that each scanline takes exactly 1024 cycles to execute. After a scanline has been processed I can immediately begin generating the next scanline. A very nice thing with this approach is that I can keep important values such as line counters and memory pointers in registers all the time.

Every scanline begins with the HSYNC signal, which is 4.7us in length. At 16Mhz that is 75.2 cycles, so we round to 75 cycles. Then the back porch is 1.65us and rounded to cycles it becomes 26 cycles. In assembly we can cycle count and output the HSYNC and back porch in 75+26  cycles. Then we have exacly 1024-75-26 = 923 cycles left for the pixels. Let's round this to 900 cycles because we need some cycles for housekeeping stuff like incrementing the current line counter and jumping to the routine processing the next scanline. For e.g. 320 pixel horizontal resolution that would be only 900/320 = 2.8 cycles per pixel. Pulling a pixel from MCU's internal SRAM takes 2 cycles and outputting a pixel takes 1 cycle so at minimum we would need at least three cycles even when doing simple direct bitmapped graphics. Initially it seems there is no way get what we want with this microcontroller.

To make matters worse, a bitmapped image takes a lot of memory to store and is very heavy for the 6502 to process. That's why 6502 computers usually have a character based display mode, where the screen RAM contains indices or pointers to character data stored elsewhere in memory. For example, the screen of a C64 is divided into 40x25 characters and each character is 8x8 pixels. So for every 8th pixel the video generator has to fetch the character from screen RAM and then pixels from character memory. All this increases the cycle cost way higher than 3 cycles per pixel.

Attempt that almost worked


Luckily there is a faster way to get bits out of the ATmega1284P. The ATmega1284P has a built-in Serial Peripheral Interface (SPI) which is essentially a shift register whose clock frequency can be configured. The maximum rate for SPI is system clock divided by two, that is 8 MHz in our case. After the SPI has been initialized, a byte can be outputted by writing it to the SPI data register. The SPI hardware then shifts outs the bits at 8 Mhz, i.e. at 2 cycles per pixel. What's great is that the SPI runs independently so we can execute other instructions while the SPI is doing the transfer. Ok, I wired this up and wrote a scanline routine that pull a character from memory, fetches a byte encoding the 8 pixels of a character line and outputs the byte using SPI.

Initial results were very promising. I could get 320x256 resolution and even higher seemed possible. However, then I hit a major snag! See image below.

Argh, those black vertical gaps between characters!

These is a one pixel gap between every character. Even when I waited for exactly the right number of cycles, I got this gap or either corruption on the screen. I was pretty sure I was doing everything right and it felt like a hardware problem. Googling revealed a nightmare: this is a known hardware limitation, the SPI cannot send a continuous stream of bytes, apparently because there is no buffering. There is just a single register that gets shifted out and the hardware needs one extra cycle to load the shift register between transmits.

This was such a major setback. It seemed I would have to live with the gaps. This didn't seem like a good idea because I want to get nice character based graphics out of this thing eventually and having gaps there would certainly ruin it in a major way.

USART MSPI to the rescue!


I thought about using an external shift register as a workaround. A byte would be loaded one at a time using 8 parallel I/O pins (+ some control pins for clock signal et.), but I was already very tight on I/O pins so I couldn't afford this. I was really frustrated and considered even abandoning the idea of bitbanging the video signal using a MCU. But then after reading the datasheets carefully I learned there was another way: the built-in USART which could send data through the SPI, called the "USART in MSPI mode". The USART has a transmit buffer, so maybe the hardware could be the magic I needed to fix the gaps? A quick Googling seemed to indicate that this could be possible. So last night I make the necessary changes and nervously fired up my microcomputer... and huzzah, the gaps were gone!

With this victory under my belt, I optimized the code further. I could now output a 8 pixel wide character in just 16 cycles, including the screen RAM to character data indirection. With this I could extend lines to 50 characters, yielding a resolution of 400x256. The character generation now needs 50*16 = 800 cycles so there is still some time left. I could still extend the screen width a bit, but I'm going to settle for this nice round number for now.

You can find the source code of the project at GitHub. The screen contents is so far stored in ATmega's internal SRAM and completely static. Next I'm going to interface it with the 6502 and then the real fun can begin!

Finally here's a final gapless screenshot using a very familiar character set.



Saturday, November 29, 2014

ERIC-1: Homebrew Computer

Everyone seems to be building their 6502-based 80s esque computers nowadays, and it seems to be a lot of fun. Well, I don't want to miss the party, so I've recently started building one of my own. I've now been working on the thing for a few nights and here's what I've got so far....


Behold the mighty ERIC-1 running at whopping 2 Mhz! 


As you can see I'm building this on a breadboard, but the plan is to build the final version on a PCB that I'll be etching myself. But before I can do that I need to settle on a few features.

The general design is quite simple. There's is the 6502 CPU, 64 KB of SRAM (actually I'm using a 128 KB SRAM chip but the 6502 can only address 64 KB) and a coprocessor. The coprocessor is an ATMega1284P microcontroller that has several purposes: first it generates the necessary reset and clock signals for the 6502. It also contains the ROM image and implements the I/O interface for the system. Later I'm planning to use it to generate composite video and maybe sound.

Reset and Clock Signals


The 6502 seems to be very picky about the quality of the reset and clock signals. I found out the hard way that simply connecting a button to the reset line is not enough, like with e.g. AVR microcontrollers. The reset signal must be clean, rise quickly and be properly debounced. Also the clock signal can only be stopped in high state although modern versions of the 6502 do not have this restriction. I wanted to be able to switch between different clock frequencies easily and switch between free run and single step modes on the fly. The ATMega1284P can handle these tasks and more easily.

Shared Memory


The only way to implement I/O with a 6502 is to use memory mapping. This means that the I/O devices usually sit on the address and data bus of the 6502 and the 6502 simply reads and writes certain addresses corresponding to the I/O devices. Usually this is implemented with some sort of address decoding logic which generates chip select signals for devices based on the status of the address lines.

I wanted to try a different approach. In ERIC-1 the 6502 and the coprocessor share the same 64 kilobytes of memory. Naturally both devices cannot access the memory at the same time, so the coprocessor acts as the bus master. Since the coprocessor also generates the clock signal for the 6502, it can simply halt the 6502 when it needs to access the memory. When the 6502 is halted it still drives the address bus so I'm using 74HC541 buffers to detach the CPU from the bus. The buffers are controlled by a single output pin of the coprocessor.

But what about the data bus? In case the CPU is halted during a write cycle, the 6502 is driving the data lines. Trying to update the SRAM at the same time will cause bus contention when both the coprocessor and 6502 are trying to write data to the bus. A simple fix for this is to halt the CPU only during read cycles, e.g. when 6502's R/W is high. This means the coprocessor may have to wait a few clock cycles longer when it wants to access the bus, but overall this seems like a good solution.

Here is the code fragment from the AVR firmware that I'm using to halt the CPU:


ROM


Any computer needs some memory non-volatile memory (memory that keeps its state even when powered off) so that it knows what to do when it boots up. Usually in 6502 systems there is a separate ROM chip that contains the boot up routines of the computer. The ROM chip is mapped to upper part of the 64 KB address space, because when the 6502 wakes up it first reads the starting address from $FFFC - $FFFD (6502 is little endian, so the lo byte of the address is stored first, then the hi byte). This reset vector points to the machine code routine (in ROM) that should be executed first.

Since the coprocessor of ERIC-1 has full access to the SRAM, I decided to use the upper part of SRAM as ROM. The ROM code is stored in microcontroller's flash memory. At startup the coprocessor holds the 6502 in reset while the ROM image is copied to the SRAM chip.

What can it do?


To test the computer that I've built so far, I made a simple ROM routine that… you guessed it, blinks a LED! Here is the ROM routine, hand assembled in the coprocessor firmware:

The coprocessor firmware holds the 6502 in reset, copies this piece of code to SRAM, starts generating the clock signal. Every thousand clock cycles of so it halts the 6502, reads the byte at address $10 and turn the LED on/off based on the value read: if the value is below 128 the LED is on, otherwise it's off. The ROM routine therefore has a delay loop (the X registers counts from 255 to 0), otherwise the blinking would be way too fast to notice with the naked eye.

Now that I have the basic setup working, I'm going to add more I/O. I'll probably start working on the video generation next. Until that time!

Schematic



p.s. ERIC-1?? Some of you may remember an obscure 6502 based computer, the Oric from the 80s. It was not a hugely popular computer, and in fact I've never even seen one, but I thought it would cool to nickname my computer to remind of a real 6502 system. Or it could just be acronym for "Extraordinarily Robust Integrated Computer" :)  

Monday, November 24, 2014

Driving a RGB LED with Arduino Uno


Here's a simple way to drive a RGB LED with Arduino Uno. The microcontroller generates a multiplexed PWM signal in software. The PWM is generated in an interrupt routine so the main program and LED update frequency are completely independent. The only requirements is that main program does not use hardware timer 1.

Components required: one common cathode RGB LED, one 330 ohm resistor.