Wednesday, June 19, 2013

Direct Digital Synthesis (DDS) on the GA144 (Part 2)

I found out why I couldn't get two nodes to run with my start script. I was too impatient. I'd forgot that I had the serial connection to the Eval board running at only 115kbps. The command to install the code in the various nodes visits all the nodes and it was simply taking forever and I assumed the IDE had stalled.

Anyway, I took the delay word out of node 617 and put a simple counting loop in node 616 which asserts the 'right' port (which is common to the nodes) when the loop finishes. Then node 617 waits till its 'right' port is asserted, then continues with code to load the sine value into the DAC port.

This makes the sine wave much more stable. With the value of 500 in the go loop the sine wave is precisely 799.9Hz and stays rock solid on that value. I doubt if a crystal would make it much more stable but I'll try a crystal next.

With the following code, make sure 866 and 868 are loaded in block 200 to ensure they are compiled. Then '870 load' will start the app.

866 list 
sine wave generator,
617 node 0 org,
0 , 40 , 80 , 120 , 170 ,,
220 , 280 , 370 , 511 ,,
hart 3300; -.00433 .07943
            -.64589 1.57079,
cos tri 2* 2* . triangle
    dup *. 2 poly
  
 -281 , 5203 , -42329 , 37407 ,
  
 push drop pop *. + ;
scaled
 2/ 8000 . +
       8191 12 interp ;
dac! io b! 155 or !b ;
start
 1f 128 phaseinc
  
 dup dup or
  
 begin
  
 dup cos scaled
  
 synch right b! @b drop
  
 dac!
  
 over . +
  
 end


868 list 
timer count cycles,

616 node 0 org,
hold for . . unext ;,
ms for 27063
     for . . . unext
  next;

start 08 right b! 1 ms,
go 500 hold !b go ;

870 list 
sine wave loader

host load loader load
using default ide paths
kill boots
0 708 hook 0 -hook
setup application
616 +node 616 /ram 8 /p
617 +node 617 /ram 1f /p
visit sine path

panel pause 2 ship

Tuesday, June 18, 2013

Direct Digital Synthesis (DDS) on the GA144

I wanted a simple demo of the GA144 producing sound. I found a sample DDS app for a SeaForth chip (predecessor to the GA144) but the code syntax is sufficiently different that I couldn't simply copy it.

The concept is simple enough. Calculate the values of a sine wave over a fixed time interval and load those values into a digital-to-analog converter (DAC). Connect the DAC to a loudspeaker and listen to the pretty tone.

Rather than look up tables of (co-)sine values, which can take up lots of memory, this app chooses to approximate the values at each interval using relatively fast calculations. The code takes much less room and easily fits into the limitations of the F18 nodes in the GA144.

But it took me a couple of days to understand the code. It all resolved to two words: cos and scaled. cos in turn uses three other words: triangle, poly and *. while scaled uses interp. Now I knew that each of those words is part of the GA144 ROM library. There was also an intriguing comment in cos: 'Hart, 3300'.

So I googled for 'Hart cosine' and one of the first hits was a page by Chuck Moore about sines and cosines in colorForth. This page also explained the 'Hart, 3300' comment, viz., Computer Approximations by John F Hart, Wiley 1968; function 3300. I googled for 'Computer Approximations Hart' and found that the only copy I could buy was hardcover from Amazon. However I also found a link to an electronic copy which allowed me to read the theory behind the magic numbers that turn up in the code.

And then I (quite by accident) discovered the shadow page for the poly word on the GA144 which contains the exact same code as the cos method but translated into colorForth.

cos f-f'
hart 3300
-0.0043 0.0794 -0.6459 0.5708
  
 2* 2* . triangle dup *. 2 poly
  
 -281 , 5203 , -42329 , 37407 ,
  
 push drop pop *. + ;                       

So now I needed to find out how to use the interp word to scale the cosine values into values between 0 and 511. Lots of test code and simulations later...

The final code was this:
866 list 
sine wave generator,
617 node 0 org,
0 , 40 , 80 , 120 , 170 ,,
220 , 280 , 370 , 511 ,,
hart 3300; -.00433 .07943
          -.64589 1.57079,
cos tri 2* 2* . triangle
   dup *. 2 poly
  
 -281 , 5203 , -42329 , 37407 ,
  
 push drop pop *. + ;
scaled
 2/ 8000 . + 8191 12 
       interp ;
dac! io b! 155 or !b ;
delay
 700 1 for . . unext ;
start
 22 128 phaseinc
  
 dup dup or
  
 begin
  
 dup cos scaled
  
 delay dac!
  
 over . +
  
 end
A couple of explanations. The delay word in the original was another node running a single timing loop and raising a signal on a comms port at the end. This in turn allows the main node to synchronise with a (relatively) accurate clock signal. However I ran into problems starting multiple nodes on the chip so I brought the synch delay into the node, preferring to sacrifice some accuracy for simplicity. The constant, 700, controls the amount of delay between DAC updates and, hence, the frequency. 500 gives around 550Hz, 700 gives around 447Hz.

interp expects an interpolation table at address 0. In this case it's not quite a linear interpolation. (Refer to DB001 F18A Technology Reference for details.) scaled halves the cos value, adds 0.5 (to scale into positive numbers only) and calls interp with s and m calculated for L=16 bits and n=3 (a 9-entry table i.e. 2**3+1) to interpolate the values between 0 and 511. dac! sets the DAC value on the node.

The output of the DAC needs to connect to a resistive load. The EVB001 kit includes some 47ohm resistors so I soldered one between the DAC pin and earth and then I was able to see a (relatively) clean sine wave output on my 'scope. It also sounded like a sine wave when I connected a speaker.

Make sure block 200 contains a load of block 866 so 'compile' will include it. The block to start the app on the GA144 (copied from 9.4 of the Users Guide) is:

872 list 
demo ide boot empty compile serial load
customize -canon 0 fh orgn !
a-com sport ! a-bps bps ! !nam
run talk 0 617 hook 0 64 617 boot
  
 upd ?ram panel 22 call ;

To load and run type '872 load' and 'run' in the IDE.

My next task is to get a second node working as a timer to increase the accuracy of the sine wave (possibly using a crystal?). Then I want to port another VentureForth app, Musicbox, to the GA144.

Tuesday, June 11, 2013

Combsort for Softsim

Chuck Moore has put a bubblesort script on one of his colorForth pages:

bubble nt a! for @+ @ - + - -if
@ a -1 + a! @ push !+ pop ! then drop next ;


Simple and compact. I was interested to see if such a simple sort could be speeded up without using too many more instructions. I found the Wikipedia entry for Combsort which calculates a decreasing gap between the two comparison values with the aim of getting the smallest values moved into place as quickly as possible.

So I started transliterating the Wiki algorithm into colorForth. The gap is shrunk by a factor of 1.3 which is apparently a result of some extensive testing with varying shrink factors. Initially I wrote a subroutine to multiply by 10 then divide by 13 (plus lots of shuffling registers around):
shrinkgap a push dup dup or over
   10 * -13 --u/mod pop a! push
   drop drop drop pop ;

And then it hit me that 1.3 is only an approximation. There's nothing absolute about it. It occurred to me that if would be much more efficient to use factors of 2 where possible. So my shrink calculator is simply:
2/ dup 2/ +
In other words take (1/2+1/4) (i.e. 0.75) of the gap. 1/1.3 = 0.77 which is amazingly close to 0.75.

In the end my Combsort still took lots of code and only saved a few hundred cycles compared to Chuck's bubblesort.

comb sort,
504 node 0 org,
4 , 19 , 7 , 1 , 12 , 13 , 0 ,,
11 , 3 , 9 ,,17 , 16 , 13 , 6 ,,
2 , 1ffff , 5 ,,
shrink 11 2/ dup 2/ + if ;
   then - 2* - ;
comb
 tn push dup dup,
...
begin,
......
16 pop dup push a!,
......
dup or swpd push,
......
shrink,
......
over over - . +,
......
for dup a + b!,
......
@ - 1 + @b + -if,
......
gt 20 @ @b ! !b pop pop,
......
dup or - push push,
......
le then drop @+ drop next,
...
26 dup dup dup or - +,
......
if dup or - then gap1?,
......
pop swpd?,
......
over - and or -,
...
until pop ;,
2e 17 0 comb 32 r---


Friday, June 7, 2013

Quicksort and GreenArrays' arrayForth simulator from a beginner's perspective.

I purchased an EVB001 eval board for the GA144 chip from GreenArrays a year ago. I wasn't sure what I wanted to do with it at the time but eventually I realised the GA144 might be more appropriate for low power digital signal processing apps than the FPGA apps I'd been developing.

A couple of months ago I started trying to program it. Conceptually the GA144 seems simple: 144 asynchronous cores, CPU instruction set is a variant of Forth called colorForth, multiple ADCs, DACs and I/O ports. What could go wrong? :)

Well, for starters I hadn't written in Forth before. One of my first jobs (about 35 years ago) was writing assembly language code for 8080 and Z80 chips and I remembered many of the problems with low-level programming. Memory management was paramount. Speed of operation was next most important.

I later became a C programmer. Once again 90% of my time was spent managing memory. When I discovered a language called Perl in 1989 which had automatic memory management I was in Heaven.

A recent blog reminded me of the sheer elegance Perl is capable of. Here is a minimal Quicksort script that had me simply gazing in admiration:

sub quicksort {
    @_ and return (
        quicksort (grep { $_ < $_[0] } @_),
        grep ({ $_ == $_[0] } @_),
        quicksort (grep { $_ > $_[0] } @_)
    );
    ();
}

$, = ','; say quicksort(10,2,5,3,1,9,7,4,2,3,4,8,6);


The algorithm takes full advantage of Perl's automated memory management. Copies of lists created by the greps and the recursive calls to quicksort probably  eat up huge chunks of RAM. But the "simplicity" and "obviousness" of the algorithm are outstanding.

So could I implement Quicksort in colorForth in the arrayForth IDE simulator?

A quick scan of the online literature brought up this. The problem for me is that it is written in "standard" Forth. I hadn't looked at Forth until about a year ago and to complicate matters colorForth and Forth have almost nothing in common syntactically. Even when the instructions are the same (e.g. 'or') they do different things ('or' is "inclusive OR" in Forth, "exclusive OR" in colorForth). So unless I was prepared to thoroughly learn Forth in addition to colorForth there was no way I could use Will Baden's algorithm.

So I decided to implement the algorithm from scratch using the GreenArrays manuals and user guides.

I quickly discovered I was back in memory management hell.

Programming the GA144 is mostly about memory management. Each core (GreenArrays call it a 'node') has only 64 18-bit words of RAM plus 64 18-bit words of ROM library. The 32 colorForth instructions are sufficient to implement a viable version of Forth. There is a 10-word data stack and a 9-word return stack, a  general purpose register, A, and a write-only B register used mostly to talk to the I/O ports but which can address all of RAM and ROM.

There are 144 nodes in the GA144 chip running asynchronously but able to synchronise with the other nodes via the I/O port communication mechanism.

So the first decision I had to make was whether to use one node or multiple? The Perl example used only 13 elements in the list and since Quicksort uses O(log N) stack elements it seemed possible to fit the code and the data in the same node. (The example below runs out of stack space with 18 or more elements in the data array. It also runs out of instruction space with more than 19 elements.)

The second decision was to follow the wikipedia example using in-place element exchanges rather than creating multiple sub-lists for later merging as used in the Perl example.

The following code is mostly a line-for-line transliteration of the wikipedia example. I did steal one idea from Will Baden's example. I used his midpoint pivot selection code to bypass Quicksort's weakness when sorting already sorted lists.

The arrayForth IDE is a PITA. It's a steep learning curve for a total beginner like me. And it can't import code from outside sources. Every character has to be hand-typed. So "literate" documentation is virtually impossible and largely a waste of time. But eventually I found that the ability to quickly change an instruction and see its effect in the Softsim simulator provided me with all I needed to learn the colorForth instruction set. Admittedly the IDE won't teach me the idioms Forth programmers have used for decades. Nor will it teach me the idioms colorForth programmers have discovered. But, again, the IDE contains the full source code for colorForth, arrayForth, polyForth and a heap of tests displaying how one uses colorForth in "real life".

I also found useful idioms in Chuck Moore's notes about colorForth's instructions.

And GreenArrays' email support has been superb. Every question of mine, no matter how dumb, has been answered quickly and comprehensively.

The program code below is in coloured text (red for method definitions, green for instructions and yellow for data) and I've interspersed my comments in black text.

The code and data fill 56 of the available 64 words of RAM. Not much room left for experimentation. The starting line (at the end of the listing) loads the data stack with the beginning and ending address of the data array and calls qsort which calls part to divide the data array into partitions of smaller and larger elements around a pivot value and then recursively calls itself on each of these subarrays.

part partitions the array into three sections. The "left" section contains all array elements less than or equal to the pivot value. The "pivot" section contains the pivot value. And the "right" section contains all array elements greater than the pivot value. At completion the data stack contains (left, right, pivot) addresses. (Pivot address is in T).

But most of my time was spent trying to keep the use of the data stack to a minimum because of the recursive calls to qsort. So often I had to store data in A or R rather than leaving it on the data stack for fear that a recursive call would push previous start and end addresses off the end of the 10-element stack. In fact this does happen when there's more than 18 elements in the data array.

503 node 0 org
I picked node (CPU) 503 randomly. Nothing special about it.
19 , 4 , 7 , 1 , 12 , 13 , 0 , 11 , 3 , 9 ,
17 , 16 , 10 , 6 , 2 , 13 , 5 ,
I put the data array at the start because I always know its start address (0).

setab over over b! a! ;
Duplicate S and T then move T into B and S into A.

exch setab @ @b ! !b ;
Array addresses are in S and T. Call setab to load A and B with array element addresses, then swap array elements pointed to by A and B.

1+ 1 . + ; 1- -1 . + ;
Increment/decrement T.

part exch drop
Exchange the pivot element (address in T) with the end element (address in S). The pivot address in T can now be dropped from the stack as it is no longer needed.
   over over - . + 1+ - push
S and T contain start and end addresses of array to sort. Calculate length of array less 1 for loop counter. Push loop cntr into R register.
   setab over
setab puts array start address (S) into A and end address (T) into B. The end array element contains the pivot value. And at start of loop S is also the address of where values less than or equal to the pivot value will be swapped into (referred to as sx, switchindex, hereafter) so put a copy of sx into T (over).
   begin
      push a
Save sx onto return stack (nowhere else to safely store it) and load A into data stack
      @ @b - . +
Load contents pointed to by A onto data stack, load contents pointed to by B into stack, subtract.
      -if drop pop exch 1+ dup push
If T is negative *A (array element pointed to by A) is less than or equal to *SX (pivot value, pointed to by SX) so pop SX into T, exchange *A and *SX, increment SX and push it back into R. The dup creates a dummy value for the next line to drop
      then drop
Drops either result of subtract or dummy value.
      1+ a! dup b! pop
Drop leaves A in T. Increment T and store it in A so A points to next element in array. T now holds the value of B. Make a copy and pop it into B because the element exchange destroys the previous contents of B. Then pop SX into T.
   next exch ;
When loop finishes T holds pointer to pivot value and S holds SX pointer. So can swap pivot value into its correct place prior to return. After exchange stack holds (left, right, sx).

qsort over - over . +
Duplicate the start and end addresses and subtract the start address from the end address.
   -if drop ;
If the array addresses are equal there is nothing more to sort so return with stack effectively unchanged by dropping the result of the subtraction.
   then 2/ push over pop . +
If the array addresses are unequal there is something to sort. After the subtraction T contains the length of the array so halve it with a left shift (2/) and add it to the start address giving the address of the midpoint of the array.
   part
part is called with the data stack containing (left, right, mid) addresses.
   a! push a 1- a 1+ pop
part returns (left, right, sx) so pop SX into A, push right index into R, retrieve A, decrement it, retrieve another copy of A, increment it and then pop R into T. This leaves (left, pivot-1, pivot+1, right) in stack.
   qsort drop drop qsort ;
First call to qsort will sort right partition. Returns when right partition is (recursively) sorted. So upon return drop pivot+1 and right. Second call to qsort will (recursively) sort left partition. Upon return whole array is sorted.

0 16 qsort r---
Call qsort with start and end addresses of data array.

Code without comments:
503 node 0 org
19 , 4 , 7 , 1 , 12 , 13 , 0 , 11 , 3 , 9 ,
17 , 16 , 10 , 6 , 2 , 13 , 5 ,
setab over over b! a! ;
exch setab @ @b ! !b ;
1+ 1 . + ; 1- -1 . + ;
part exch drop
   over over - . + 1+ - push
   setab over
   begin
      push a
      @ @b - . +
      -if drop pop exch 1+ dup push
      then drop
      1+ a! dup b! pop
   next exch ;
qsort over - over . +
   -if drop ;
   then 2/ push over pop . +
   part
   a! push a 1- a 1+ pop
   qsort drop drop qsort ;
0 16 qsort r---

I wonder what a multi-core version of this algorithm would look like? And it's also pretty obvious that a recursive algorithm is not feasible with such a limited data stack. Might try a comb sort next.