DSTB1 exxos first tests & experimental firmware.

Post by **exxos** » Sun May 08, 2022 4:31 pm

ijor wrote: ↑Sun May 08, 2022 4:10 pm Off the top of my head between ST and STE, not sure if it affects your case, but are you aware about the different DTACK behavior when accessing the PSG? I remember this affected an Exxos accelerator, but I think it was a much faster one?

I wrote BW a long PM about all that. Thing is, his booster, like my previous gen boosters, all clock switch back to 8mhz on ST side bus access. So anything screwy going on , well, shouldn't matter.

The only likely reason would be clock switching problems where a PSG cycle happens and then a alt-ram cycle at 16mhz when it *could * read DTACK from the previous bus cycle and trip up. I've not had that problem, but I've always used the system clocks not external ones. While wakeup states can skew the CPU clocks up, at 16mhz it doesn't matter anyway. BUT 16mhz on a external clock is not going to be as simple,

Also I didn't post a lot of my tests. I changed BWs firmware not to switch to 16mhz, or complete a alt-ram cycle while ST side DTACK was low. It made no odds.

But also, EMUTOS behaves differently. TOS generally locks up when the sound starts giving the impression of a PSG fault. But on EMUTOS it locks up even before graphics are dawns. So it suggests not a sound fault, unless EMUTOS for some reason does more sound accesses than TOS. I don't know.

But its all kinda irrelivent at this point ,as for me, it doesn't work at 8mhz anyway, in fact its even more unstable. You would think things going at stock speeds would be more stable as timings don't need to be as tight. But I don't know how all the SDRAM timing works anyway, its why I questioned it earlier.

I just get the impression the SDRAM isn't ready , either reading or writing when the CPU terminates the cycle. But BW says that it should not be a problem. So I am at a loss.

In terms of the SEC booster, PSG DTACK was causing issues at it stayed low right up until about S2 on the next bus cycle. It was tripping up my logic, Hence why there is a SND CS jumper on the H5 boards. But that was running 64mhz all the time so things will trip up. Clock switching stuff shouldn't .

ijor · Post by **ijor** » Sun May 08, 2022 8:16 pm

As I said, I'm not an expert on this Xilinx’s CPLD family, but I took a quick look and tried to perform a timing analysis. The fact that Xilinx never updated the software, the timing tools even in the last ISE release are decades old, is not very helpful.

In first place, note that your design has several async features. Sometimes using async logic might help to fit a design, and it is OKish as long as you treat the CPLD just as a bigger PAL/GAL chip. But when you start considering the CPLD as a small FPGA, you implement complex state machines, and interface with a fast fully synchronous device such as SDRAM, then it is better to follow good synchronous design practices.

Meeting timing here is not easy. There are a few changes that might be worth considering to improve timing.

You are connecting the main clock (OSC) to a “regular” input. Use one of the global clock inputs instead. If you look at the specs, global clocks are much faster and have much less skew. In some other devices you can connect “regular” input pins to the global clock network, but not with this one. Maybe you can swap pins with “RAMCLK” or “CLKOUT” that waste global clock input pins while they are output only signals.

Btw, why are you passing RAMCLK through the CPLD at all? You are not doing anything with that signal internally, just passing it through from OSC to RAMCLK directly. If you were using a FPGA, this connection could be very useful. But in this case, this just produces a considerable (and seemingly unnecessary) delay. It is up to 15 ns (almost one full cycle ay 66 MHz) according to the timing analyzer. Perhaps the delay is not critical by itself in this case. But a long delay will produce more jitter and more variations across different setups.

Code: Select all

assign MA = READY ? SETUP_MA : MAIN_MA;
assign RAS = READY ? SETUP_CMD[2] : CMD[2];
…

Combinational I/O output is slow. Sometimes you can't help it but I think you can avoid combinational I/O here. Register all SDRAM outputs. This will provoke a huge impact on the timing. Note that after all these changes, the DRAM timing will probably change dramatically. You might need to invert the clock signal to achieve a better signal alignment.

Code: Select all

always @(posedge CLK8 or negedge RST)  begin
	if (RST == 1'b0) begin 
		COUNTER 	<= 'd0;
	end else begin 
		COUNTER <= COUNTER + 'd1;
	end
end

// indicate refresh needed and do initialisation
always @(posedge CLK or negedge RST)  begin

Any particular reason you are using two different clocks at the ram controller? Also note that this reset might corrupt ram on warm reset. It would be difficult to use a different reset though, and might not be worth the complication. But I think it is important to be aware that this might happen.

There are, however, a LOT of terms in the equations governing the SDRAM. I honestly just think this is one too many. The line probably isn't stable in time and the state machine gets out of whack. For you, it's FB that shits out, for me it's MiNT.
Try removing the altram_ext term from the ACCESS() line again, and I bet the problem goes away.

This is quite difficult to analyze because you are mixing clocks here. But the number of product terms has a, relatively, small impact in the overall delay. The clock skew, the external delay, and the flip flop delay (clocked again with a non-global clock) itself are much more significant. I think that the clock skew might be so huge here, that you should probably consider the two clock domains as being asynchronous. The original oscillator clock is first divided, then it goes through the clock mux, and lastly, because you don’t use the CPU clock itself, you “suffer” the clock to output delay produced by the CPU when asserting AS. Just the last component alone according to the 68K datasheet could be up to 30ns. That’s almost two cycles at 66 MHz!

… but what are the key signals on the third state of nine of the DRAM controller? No idea.

Not sure if this was just a rhetoric question … From the point of view of the SDRAM chip, all signals are equally critical. Well except the clock itself that obviously it’s special. They are all synchronous and they all have the same specs. Depending on the design of your controller, some signals might have a less critical behavior though. Take the DQM signals, i.e. For a full-fledged controller that writes multiple data in subsequent cycles, these signals are still very critical, they must meet timing rigorously. But here you never write multiple data back-to-back. You can setup these signals earlier than they are actually needed and then they become non critical.

Badwolf · Post by **Badwolf** » Mon May 09, 2022 12:43 pm

Hi Ijor,

Thanks for taking the time to respond. Long one coming up:-

ijor wrote: ↑Sun May 08, 2022 4:10 pm No idea if this the case here, but it might be very well a timing problem. When you test it and it works, it just mean it works on your own setup on specific conditions. But conditions might be different for others. This includes of course differences in voltage and temperature. But also each chip has a different performance. Let alone if somebody decides to use a compatible SDRAM chip from a different manufacturer.

All true and understandable, which is why we test. In this case I built the board and know it works here, but there's a different motherboard, CPU and PSU at the other end.

Off the top of my head between ST and STE, not sure if it affects your case, but are you aware about the different DTACK behavior when accessing the PSG? I remember this affected an Exxos accelerator, but I think it was a much faster one?

I was half aware of it but don't know the details -- like you say Exxos has mentioned issues in the past and I saw one of Stephen's TF videos identified it too. Unlike Exxos I think that clock switching doesn't guarantee it's avoided, so this is on the list once the SDRAM is sorted out (although it's possible it may be connected, do you know the details of the differnces? Would it affect anything other than sound and serial? Ta.)

You are connecting the main clock (OSC) to a “regular” input. Use one of the global clock inputs instead. If you look at the specs, global clocks are much faster and have much less skew. In some other devices you can connect “regular” input pins to the global clock network, but not with this one. Maybe you can swap pins with “RAMCLK” or “CLKOUT” that waste global clock input pins while they are output only signals.

This is obviously a case I've misunderstood the usefulness of the global clocks. I thought the output clocks would be more important as I take my cues from these derived clocks, rather than from the input one. It lets me set timing specifications to the clocks that the rest of the system sees (not that I have in this design as, like I said, I don't really understand it).

So should I take it as a rule of thumb that input clocks are better going through GCKs than outputs?

Btw, why are you passing RAMCLK through the CPLD at all? You are not doing anything with that signal internally, just passing it through from OSC to RAMCLK directly.

In short, flexibility. In DFB1, for example, the RAMCLK slows down with the CPU clock to ensure synchronisaion during switching (it's a synchronous [STERM-based] RAM cycle there). I rewrote the SDRAM controller here to 'emulate SRAM' so the asynchronous DTACK-based ram cycle shouldn't care about sync. But deriving the RAMCLK lets me change that behaviour if it turns out to be a problem.

Code: Select all
assign MA = READY ? SETUP_MA : MAIN_MA;
assign RAS = READY ? SETUP_CMD[2] : CMD[2];
…
Combinational I/O output is slow. Sometimes you can't help it but I think you can avoid combinational I/O here. Register all SDRAM outputs. This will provoke a huge impact on the timing. Note that after all these changes, the DRAM timing will probably change dramatically. You might need to invert the clock signal to achieve a better signal alignment.

This sounds great, but I confess to not knowing exactly how to do what you mean. Is the latter just a case of declaring output reg RAS, for example?

The multiplexed output is to avoid extra logic within the state machine (which was my limiting factor). What would you suggest instead? Simply setting the new 'RAS' register in an always block with the same multiplexer, or something else?

Any particular reason you are using two different clocks at the ram controller?

Mostly to need fewer bits in the counter which is only used to initialise the RAM in the SETUP phase. Secondarily I know the CPU8 line is fixed at 8MHz so can work out the startup delay without needing to worry about user changes to the oscillator.

Also note that this reset might corrupt ram on warm reset. It would be difficult to use a different reset though, and might not be worth the complication. But I think it is important to be aware that this might happen.

Noted and accepted. I make no effort to preseve the contents of RAM if the user hits reset.

This is quite difficult to analyze because you are mixing clocks here. But the number of product terms has a, relatively, small impact in the overall delay. The clock skew, the external delay, and the flip flop delay (clocked again with a non-global clock) itself are much more significant.

My flipflops aren't clocked to CLKOSC but to the derived output lines which are routed to GCKs. Are GCKs only of any use on inputs?

I think that the clock skew might be so huge here, that you should probably consider the two clock domains as being asynchronous. The original oscillator clock is first divided, then it goes through the clock mux, and lastly, because you don’t use the CPU clock itself, you “suffer” the clock to output delay produced by the CPU when asserting AS. Just the last component alone according to the 68K datasheet could be up to 30ns. That’s almost two cycles at 66 MHz!

I do treat them as asynchronous, there's nothing synchronous in the bus cycle logic at all, only within the SDRAM module are things synchronous (and that's to RAMCLK). The SDRAM controller will keep driving the same output until it's (asynchronously) acknowleged and will not assert DTACK until it's driving. This is not the most efficient way to drive DTACK, but I viewed it as safe and why I thought it should work independently of CPU speed (although you will not achieve max throughput beyond about 16-25MHz -- trade off).

… but what are the key signals on the third state of nine of the DRAM controller? No idea.
Not sure if this was just a rhetoric question … From the point of view of the SDRAM chip, all signals are equally critical.

Yes, this was both rhetorical and hypothetical, but it was meant to illustrate that I wouldn't know what input to output delay was critical to which behavour.

I understand that the SDRAM controls are all synchronous (the clue's in the name!) but, for example, if my state machine is occasionally missing a step because there's one too many multiplexes within one block for the wires to be stable in time for the next clock, I'm afraid I simply don't know how or what I should be examining to establish that. So I iterate & test.

BTW, this SDRAM controller is a new build because it has different requirements to the DFB1 one, but the overall approach is based on snippets I've learned from DFB1, from how Stephen does his and from examples on the web. Of course it's those little details -- knowing quite the effect of GCK on inputs versus outputs, how much you can fit into a state machine safely, what to look for in the timing report, what & how to TIMESPEC where I don't have the background.

Cheers,

BW.

ijor · Post by **ijor** » Tue May 10, 2022 2:46 am

Badwolf wrote: ↑Mon May 09, 2022 12:43 pm Long one coming up:

Long reply here as well

(Re: about the different DTACK behavior when accessing the PSG?)

do you know the details of the differnces? Would it affect anything other than sound and serial? Ta.)

GLUE delays DTACK a couple of cycles when accessing the PSG. This delay affects both assertion and de assertion of DTACK. That means that GLUE keeps asserting DTACK when the next bus cycle is already started. This is a not a problem for a CPU running at normal speed. By the time the CPU is checking DTACK on the next cycle, GLUE deasserted (the old) DTACK already. But if the CPU is fast enough, then it might happen that it will end the next bus cycle before the target device was ready. As Exxos said, it affected his older accelator at 64 MHz. Not sure if this is your problem at "just" 16MHz.

It doesn't affect the access to the PSG chip itself. It affects the next bus cycle. This was fixed in the STE combo chip and DTACK is deasserted immediately together with AS.

So should I take it as a rule of thumb that input clocks are better going through GCKs than outputs?

It depends on the specific device, or at least on the specific family. For this CPLD family, the XC9500XL, yes, global clocks are only available at pin inputs.

In short, flexibility. In DFB1, for example, the RAMCLK slows down with the CPU clock to ensure synchronisaion during switching (it's a synchronous [STERM-based] RAM cycle there).

I understand the idea, but IMHO, here you are paying a cost too high. This delays RAMCLK for almost a full 66 MHz cycle. The worst part of this is that you have no control of this delay. It might be one cycle or it might be half a cycle. It is very difficult to meet timing with such uncertainty,

The ideal solution for this is to use a device with its own PLL (like a MAX-10). You can then do almost what you want with the clock and still keep the edges aligned. I do realize this might be too much for such a low cost project.

Combinational I/O output is slow. Sometimes you can't help it but I think you can avoid combinational I/O here.
This sounds great, but I confess to not knowing exactly how to do what you mean. Is the latter just a case of declaring output reg RAS, for example?

No. Registering in this context means that the signal should be the direct output of a flip flop, with no combinational logic after. Registered signals are faster, have less skew, and do not glitch (the latter is not very important in this case, but it is very important in other cases).

The multiplexed output is to avoid extra logic within the state machine (which was my limiting factor). What would you suggest instead? Simply setting the new 'RAS' register in an always block with the same multiplexer, or something else?

I doubt the size of the state machine is your limiting factor. Combine both synchronous blocks with something like this:

Code: Select all

		if (READY == 1'b1) begin
                         ...
			CMD			<= CMD_PRECHARGE;
			...
		else begin		// when READY == 1'b0
			...
			CMD <= CMD_NOP;
			...
			
assign RAS = CMD[2];
// Same for all other SDRAM signals
...

Any particular reason you are using two different clocks at the ram controller?
Mostly to need fewer bits in the counter which is only used to initialise the RAM in the SETUP phase ...

This is not really a very good idea. You are transferring data from two unrelated clock domains without any synchronization. You can't freely mix two clocks like that. It is perfectly possible that you would read the wrong value at the target.

Code: Select all

always @( posedge CLK )
	state <= nextstate;

always @( negedge CLK ) begin
	...
	case(state)
		nextstate <= STATE_REFRESH_NOP1;

This is bad. You are writing on one edge of the clock and reading on the other edge. You are effectively reducing the cycle time to the half. In other words, these signals would need to be as fast as if the clock would have double the frequency (133 MHz). And unsurpisingly, with only 7.5ns from edge to edge, you don't meet timing here.

Furthermore, you don't need this at all. Eliminate nextstate altogether, and just write directly to the "state" signal. Or if you prefer, set nextstate separately in a combinational process (this is just a matter of style).

My flipflops aren't clocked to CLKOSC but to the derived output lines which are routed to GCKs.

It doesn't work like that. Your flip-flops do are clocked by CLKOSC. You can't use an actual output as a clock, or as any kind of input for that matter. The actual clock is whatever is driving that output. The fact that you connected that output to a GCK pin is not relevant.

That's another benefit of performing a timing analysis. The results would hint you about such issues. In this case you can see how much faster are the signals clocked by CLK8 (the only one that you are actually using a global clock). It is also very useful to use the Technology Schematics Viewer (at the tools menu), to see exactly what the compiler synthesized from your code.

I do treat them as asynchronous, there's nothing synchronous in the bus cycle logic at all, only within the SDRAM module are things synchronous (and that's to RAMCLK).

If you consider them async, then you should better synchronize "altram_access_int". It is not a good idea to feed an unsynchronized signal to a machine state. You should also check for any potential hazards.

Btw, are you aware that AS is kept asserted all the time on a RMW bus cycle (when using TAS)? Did you check this doesn't break your logic?

I understand that the SDRAM controls are all synchronous (the clue's in the name!) but, for example, if my state machine is occasionally missing a step because there's one too many multiplexes within one block for the wires to be stable in time for the next clock, I'm afraid I simply don't know how or what I should be examining to establish that. So I iterate & test.

In first place it is not only the SDRAM that is synchronous. Your design is also synchronous. Even when you are using some async logic, it still has many flip flops. If you violate timing specs for any flip flop, the results are unpredictable. This is as true for the internal flip flops as for the ones on the SDRAM chip.

In second place, I think you are too concerned with the number of terms. What is known as the combinational path. This is a typical problem on other devices like on a FPGA. But the architecture of this CPLD is very different. As I said on my previous message, the number of terms is usually not as significant as other internal delays. Certainly, using a non global clock and using both edges of the same clock, are much more significant.

So what you do to detect at all if your state machine meets timing? Well, as I said since the beginning, you constrain the design and perform a timing analysis. The timing analyzer would tell you. Constraining the external interface is not easy (and with such a delay on the RAMCLK output it probably won't be reliable anyway). But the machine state itself (the internal timing) is, from this point of view, rather simple.

Badwolf · Post by **Badwolf** » Tue May 10, 2022 11:15 am

ijor wrote: ↑Tue May 10, 2022 2:46 am
Badwolf wrote: ↑Mon May 09, 2022 12:43 pm Long one coming up:
Long reply here as well

Ijor,

Thanks very much for this. It's cleared up a few misconceptions I've obviously had and picked up on a couple of things that I'd left over from earlier experiments by mistake (the state/nextstate one being on different edges, for example).

Plenty of scope for improvement.

In terms of the CLK8 handling the initialisation, I'll take your word for it. I don't see the issue, but then like you point out, I'm multiplexing after the event. If I bring that whole logic back into the synchronous section then it makes more sense to have a single clock.

I have never doubted the utility of timing analysis, by the way. I simply don't understand how to interpret it and the TIMESPEC declaration in the UCF file has utterly defeated me time and time again!

When I work on the refactoring to address some of these issues, I'll have a look at running a bodge wire from the oscillator's output resistor to the RAMCLK line directly and treating the RAMCLK line as the main OSC input (it'll be on a GCK then).

Thanks again,

BW.

Steve · Post by **Steve** » Tue May 10, 2022 12:23 pm

I don't understand any of this, but I feel the conversation between @ijor and @Badwolf is the kind of dialogue that could effectively create world peace.

Post by **exxos** » Tue May 10, 2022 12:58 pm

Steve wrote: ↑Tue May 10, 2022 12:23 pm I don't understand any of this, but I feel the conversation between @ijor and @Badwolf is the kind of dialogue that could effectively create world peace.

Quick translated version:

Ijor is saying BW's coding practices are not "optimal" for what he is trying to do.

b_squared · Post by **b_squared** » Tue May 10, 2022 4:41 pm

@ijor, are you going off the latest code in Github? I can jump in to help look at it as well, did RTL design professionally in a past life.

ijor · Post by **ijor** » Tue May 10, 2022 4:44 pm

Badwolf wrote: ↑Tue May 10, 2022 11:15 am I have never doubted the utility of timing analysis, by the way. I simply don't understand how to interpret it and the TIMESPEC declaration in the UCF file has utterly defeated me time and time again!

You don't need to to enter the TIMESPEC constrain manually. Use the Constrain Editor at the Tools menu. Assuming you already run a compilation, the constrain editor will automatically detect unconstrained clocks. Click on the main CLKOSC (forget about the other clocks for the time being), enter the time period and accept all the default parameters. That's all it takes (for a basic clock TIMESPEC constrain).

Btw, are you aware that AS is kept asserted all the time on a RMW bus cycle (when using TAS)? Did you check this doesn't break your logic?

Did you check that?

Badwolf · Post by **Badwolf** » Tue May 10, 2022 4:56 pm

ijor wrote: ↑Tue May 10, 2022 4:44 pm You don't need to to enter the TIMESPEC constrain manually. Use the Constrain Editor at the Tools menu. Assuming you already run a compilation, the constrain editor will automatically detect unconstrained clocks. Click on the main CLKOSC (forget about the other clocks for the time being), enter the time period and accept all the default parameters. That's all it takes (for a basic clock TIMESPEC constrain).

That I can do (and have done previously). Is that useful by itself? [I had always assumed since I wasn't using that for anything then it wasn't, but you've disabused me of what I'm going to call the 'derived clock fallacy', so I presume it is

]

The 'Performance Summary' (very highest level summary) shows:

Code: Select all

Min. Clock Period 	14.000 ns.
Max. Clock Frequency (fSYSTEM) 	71.429 MHz.
Limited by Clock Pulse Width for CLKOSC_4.Q
Clock to Setup (tCYC) 	11.000 ns.
Pad to Pad Delay (tPD) 	18.700 ns.
Setup to Clock at the Pad (tSU) 	6.500 ns.
Clock Pad to Output Pad Delay (tCO) 	38.200 ns.

Btw, are you aware that AS is kept asserted all the time on a RMW bus cycle (when using TAS)? Did you check this doesn't break your logic?
Did you check that?

Yes, I have and I think it is a gap in the logic: ACCESS() depends on AS but not U/LDS. I'm not sure what would happen on a real chip but my guess is the write would be missed.

I think that's a genuine error, but I'm guessing it's a rare enough instruction that I've not seen it with my personal use.

I'll certainly close that loophole when I have a go at rewriting the SDRAM controller taking your advice on board.

I also need to check if that behaviour's the same in the 030, as I suspect DFB might be the same.

Ta,

BW.

DSTB1 exxos first tests & experimental firmware.

Re: DSTB1 exxos first tests & experimental firmware.

Re: DSTB1 exxos first tests & experimental firmware.

Re: DSTB1 exxos first tests & experimental firmware.

Re: DSTB1 exxos first tests & experimental firmware.

Re: DSTB1 exxos first tests & experimental firmware.

Re: DSTB1 exxos first tests & experimental firmware.

Re: DSTB1 exxos first tests & experimental firmware.

Re: DSTB1 exxos first tests & experimental firmware.

Re: DSTB1 exxos first tests & experimental firmware.

Re: DSTB1 exxos first tests & experimental firmware.