# What's Your Favorite Processor on an FPGA?

R

#### rickman

Jan 1, 1970
0
I have been working on designs of processors for FPGAs for quite a
while. I have looked at the uBlaze, the picoBlaze, the NIOS, two from
Lattice and any number of open source processors. Many of the open
source designs were stack processors since they tend to be small and
efficient in an FPGA. J1 is one I had pretty much missed until lately.
It is fast and small and looks like it wasn't too hard to design
(although looks may be deceptive), I'm impressed. There is also the b16
from Bernd Paysan, the uCore, the ZPU and many others.

Lately I have been looking at a hybrid approach which combines features
of addressing registers in order to access parameters of a stack CPU.
It looks interesting.

Anyone else here doing processor designs on FPGAs?

B

#### Bill Sloman

Jan 1, 1970
0
I have been working on designs of processors for FPGAs for quite a
while. I have looked at the uBlaze, the picoBlaze, the NIOS, two from
Lattice and any number of open source processors. Many of the open
source designs were stack processors since they tend to be small and
efficient in an FPGA. J1 is one I had pretty much missed until lately.
It is fast and small and looks like it wasn't too hard to design
(although looks may be deceptive), I'm impressed. There is also the b16
from Bernd Paysan, the uCore, the ZPU and many others.

Lately I have been looking at a hybrid approach which combines features
of addressing registers in order to access parameters of a stack CPU.
It looks interesting.

Anyone else here doing processor designs on FPGAs?

Sounds like something where you'd get more responses on comp.arch.fpga.

Are you cross-posting?

R

#### rickman

Jan 1, 1970
0
My guys have been ragging me for years to do designs that have soft-core CPUs in
FPGAs, but I've been able to convince them (well, I am the boss) that they
haven't made sense so far. They use up too much FPGA resources to make a
mediocre, hard to use CPU. So we've been using separate ARM processors, and
using a bunch of pins to get the CPU bus into the FPGA, usually with an async
static-ram sort of interface.

There's supposed to be a Cyclone coming soon, with dual-hard-core ARM processors
and enough dedicated program RAM to run useful apps. When that's real, we may go
that way. That will save pins and speed up the CPU-to-FPGA logic handshake.

If the programs get too big for the on-chip sram, I guess the fix would be
external DRAM with CPU cache. There goes the pin savings. At that point, an
external ARM starts to look good again.

The choice of an internal vs. an external CPU is a systems design
decision. If you need so much memory that external memory is warranted,
then I guess an external CPU is warranted. But that all depends on your
app. Are you running an OS, if so, why?

The sort of stuff I typically do doesn't need a USB or Ethernet
interface, both great reasons to use an ARM... free, working software
that comes with an OS like Linux. (by free I mean you don't have to
spend all that time writing or debugging a TCP/IP stack, etc)

But there are times when an internal CPU works even for high level
interfaces. In fact, the J1 was written because they needed a processor
to stream video over Ethernet and the uBlaze wan't so great at it.

I get the impression your projects are about other things than the
FPGA/CPU you use and cost/size really aren't so important. Then you
have less reason to squeeze on size, power, unit costs, but rather
minimize development cost. If so, that only makes sense.

My next project will be similar in hardware requirements to a digital
watch, but with more processing...

R

#### rickman

Jan 1, 1970
0
FPGA ram is expensive compared to the SRAM or flash that comes on a small ARM,
like an LPC1754. Something serious, like an LPC3250, has stuff like hardware
vector floating point and runs 32-bit instructions at 260 MHz. Both the ARMs
have uarts, timers, ADCs, DACs, and Ethernet, for $4 and$7 respectively.

That is not a useful way to look at RAM unless you are talking about
buying a larger chip than you need otherwise just to get more RAM. That
is like saying the routing in an FPGA is "expensive" compared to the
PCB. It is there as part of the device, use it or it goes to waste.

If you need Ethernet, then Ethernet is useful. But adding Ethernet to
an FPGA is no big deal. Likewise for nearly any peripheral.

No point in discussing this very much. Every system has it's own
requirements. If external ARMs are what works for you, great!

We generally run bare-metal, a central state machine and some ISR stuff. I've
written three RTOSs in the past but haven't really needed one lately.

What do you do for the networking code. If you write your own, then you
are doing a lot of work for naught typically, unless you have special
requirements.

Yeah, we use the GCC compilers. Stuff like Ethernet and USB stacks are available
and work without much hassle. I don't know what the tool chains are like for the
soft cores.

So you are using networking code, but no OS?

The soft cores I work with don't bother with that sort of stuff. The
apps are much smaller and don't need that level of complexity. In fact,
that is what they are all about, getting rid of unneeded complexity.

We do a fair amount of "computing", stuff like signal filtering, calibrations
with flash cal tables, serial and Ethernet communications, sometimes driving
leds and lcds. There have been a minority of apps simple enough to use a
microblaze, and I didn't think that acquiring/learning/archiving another whole
tool chain was worth it for those few apps, what with an LPC1754 costing $4. Ethernet comms can be a hunk of code, but the rest of what you describe is pretty simple stuff. I'm not sure there is even a need for a processor. Lots of designers are just so used to doing everything in software they think it is simple. Actually, I think everything you listed above is simple enough for a uBlaze. What is the issue with that? I find HDL to be the "simple" way to do stuff like I/O and serial comms, even signal processing. In fact, my bread and butter is a product with signal processing in an FPGA, not because of speed, it is just an audio app. But the FPGA *had* to be there. An MCU would just be a waste of board space which this has very little of. Sometimes you can just do the computing "in hardware" in the FPGA and not even need a procedural language. So the use case gets even smaller. I am looking forward to having a serious ARM or two (or, say, 16) inside an FPGA. With enough CPUs, you don't need an RTOS. Xilinx has that now you know. What do they call it, Z-something? Zync maybe? How about 144 processors running at 100's of MIPS each? Enough processing power that you can devote one to a serial port, one to an SPI port, one to flash a couple of LEDs and still have 140 left over. Check out the GreenArrays GA144. Around$14 the last time I asked. You won't
like the development system though. It is the processor equivalent of
an FPGA. I call it a FPPA, Field Programmable Processor Array. It can
be *very* low power too if you let the nodes idle when they aren't doing
anything.

R

#### [email protected]

Jan 1, 1970
0
Xilinx has that now you know. What do they call it, >Z-something? Zync maybe?

ZYNQ. There is a rather low-cost eval board, named
Zedboard ( www.zedboard.org, \$395 ) which comes with
Linux pre-installed on a SD card. The ZYNQ chip
onboard contains a hard dual-core Cortex-A9 and
~1M gates worth 7th generation logic.

Regards,
Mikko

R

#### rickman

Jan 1, 1970
0
Soft core is fun thing to do, but otherwise I see no use.
Except for very few special applications, standalone processor is better
then FPGA soft core in every point, especially the price.

Everyone is entitled to their opinion, but this is *far* from fact. The
CPUs in my designs have so far been *free* in recurring price. They fit
in a small part of the lowest priced device I can find.

Most people think of large, complex code that requires lots of RAM and
big, fast external CPUs. I think in terms of small, internal processors
that run fast in a very small code space. So they fit inside an FPGA
very easily, likely not much bigger than the state machines John talks

BTW, have you looked at any of the soft cores? The J1 is pretty amazing
in terms of just basic simplicity, fast too at 100 MHz. They talk about
the source just being 200 lines of verilog, but I don't know how many
LUTs the design is, but from the block diagram I expect it is not very
big. I'm not sure I can improve on it in any significant way.

L

#### [email protected]

Jan 1, 1970
0
The annoying thing is the CPU-to-FPGA interface. It takes a lot of FPGA pins and
it tends to be async and slow. It would be great to have an industry-standard
LVDS-type fast serial interface, with hooks like shared memory, but transparent
and easy to use.

Something like ARM internal to an FPGA could have a synchronous, maybe shared
memory, interface into one of those SOPC type virtual bus structures without
wasting FPGA pins.

xilinx Zynq, arm9 with an fpga on the side

-Lasse

L

#### [email protected]

Jan 1, 1970
0
We gave up on Xilinx a few yeas ago: great silicon, horrendous software tools.
Altera is somewhat less horrendous.

at one point it did crash alot, but I haven't had many problems with
it for the
past few years

-Lasse

A

#### Allan Herriman

Jan 1, 1970
0
The annoying thing is the CPU-to-FPGA interface. It takes a lot of FPGA
pins and it tends to be async and slow. It would be great to have an
industry-standard LVDS-type fast serial interface, with hooks like
shared memory, but transparent and easy to use.

You've just described PCI Express.

- Industry standard fast serial interface.
- AC-coupled CML (rather than LVDS, but still differential).
- scalable bandwidth:
- 2.5, 5.0, 8.0 Gbps per lane.
- 1, 2, 4, 8 or 16 lanes.
- allows single access as well as bursts.
- multi-master (allows DMA).
- Fabric can be point-to-point (e.g. CPU-FPGA) or can use switches for
larger networks.
- in-band interrupts (saves pins).
- Peripherals (typically) just appear as chunks of memory in the CPU
- Widely supported by operating systems.
- Supports hot plug.
- Many FPGAs have hard cores for PCIe.
- Supported by ARM SoCs (but not the very cheapest ones).
- compatible with loads of off the shelf chips and cards.
- Easy to use (although that might be an "eye of the beholder" type of
thing).

I wouldn't recommend PCIe for the lowest cost or lowest power products,
but it's great for the stuff that I do.

Regards,
Allan

A

#### Allan Herriman

Jan 1, 1970
0
No. PCIe is insanely complex and has horrible latency. It takes
something like 2 microseconds to do an 8-bit read over gen1 4-lane PCIe.
It was designed for throughput, not latency.

I agree about it being designed for throughput, not latency. However,
with a fairly simple design, we can do 32 bit non-bursting reads or
writes in about 350ns over a single lane of gen 1 through 1 layer of
switching. I suspect there's some problem with your implementation
(unless your 2 microsecond figure was just hyperbole).

We've done three PCIe projects so far, and it's the opposite of
"transparent and easy to use." The PCIe spec reads like the tax code and
Obamacare combined.

I found the spec clear. It's rather large though, and a text book serves
as more friendly introduction to the subject than the spec itself.

One of my co-workers was confused by the way addresses come most
significant octet first, whilst the data come least significant octet
first. It makes sense on a little endian machine, once you get over the
WTF.

Hot plug is the only thing that gives us headaches. PCIe Hot plug is
needed when reconfiguring the FPGA while the system is running.
OS support for hot plug is patchy.
Partial FPGA reconfiguration is one workaround (leaving the PCIe up while
reconfiguring the rest of the FPGA), although I haven't tried that in any
production design yet.

Regards,
Allan

A

#### Allan Herriman

Jan 1, 1970
0
Writes are relatively fast, ballpark 350 ns gen1/4lane. Reads are slow,
around 2 us. That's from an x86 CPU into the PCIe hard core of an Altera
FPGA, cabled PCIe. A read requires two serial packets so is over twice
the time of a write.

I thought it was faster than that. If I remember, I'll measure some in
the lab tomorrow.

BTW, the write requires two packets as well.

We are still trying to get hot plug to work, both Linux and Windows.
HELP!

I don't know anything about hot plug support on Windows. On Linux,
however, there are two ways to do it:

- True hot plug. You need to use a switch (or root complex) that has
hardware support for the hot plug signals (particularly "Presence Detect"
that indicates a card is plugged in). The switch turns these into
special messages that get sent back to the RC, and the OS should honour
these and do the right thing. This should work on Windows too, as it's
part of the standard.

- Fake hot plug. With the Linux "fakephp" driver you can fake the hot
plug messages if you don't have hardware support for them. This isn't
supported in all kernel versions though. Read more here:
http://scaryreasoner.wordpress.com/2012/01/26/messing-around-with-linux-
pci-hotplug/

In both cases there can be address space fragmentation that can stop the
system from working. By that I mean that the OS can't predict what will
be plugged in, so it can't know to reserve a contiguous chunk of address
space for your FPGA. The OS may do something stupid like put your
soundcard right in the middle of the space you wanted. Grrr.

Recent versions of the Linux kernel allow you to specify rules regarding
address allocation to avoid the fragmentation problem, but I've never
used them and I'm not a kernel hacker, so I don't know anything about
that.

Regards,
Allan

L

#### [email protected]

Jan 1, 1970
0
Writes are relatively fast, ballpark 350 ns gen1/4lane. Reads are slow, around 2
us. That's from an x86 CPU into the PCIe hard core of an Altera FPGA, cabled
PCIe. A read requires two serial packets so is over twice the time of a write.

A random read or write from an embedded CPU, to, say, a DPM in an FPGA, really
should take tens of nanoseconds. We do parallel ARM-FPGA transfers with aklunky
async parallel interface in 100 ns or so, but it takes a lot of pins.

From an x86 (not that we'd ever use an Intel chip in an embedded app) we haven't
found any way to move more than 32 bits in a non-DMA PCIe read/write, even on a
64-bit CPU that has a few 128-bit MOVE opcodes.

Little-endian is evil, another legacy if Intel's clumsiness.

why is it any more or less evil than big endian?

-Lasse

N

#### Nico Coesel

Jan 1, 1970
0
Most entry level scopes consist of an FPGA running a soft processor.
The annoying thing is the CPU-to-FPGA interface. It takes a lot of FPGA pins and
it tends to be async and slow. It would be great to have an industry-standard
LVDS-type fast serial interface, with hooks like shared memory, but transparent
and easy to use.

You mean PCI express?

A

#### Allan Herriman

Jan 1, 1970
0
On Mon, 22 Apr 2013 09:27:04 -0700, John Larkin wrote:

[ snip pcie hot plug discussion ]
We're assuming that an application will crash if its memory-mapped
target region (in our case, the remapped VME bus) vanishes. What we
can't do so far under Linux is re-enumerate the PCI space and start
things back up without rebooting.

With fakephp, you should just need to rescan that slot. With proper hot
swap hardware support, it should just happen automatically. (As if
anything would go wrong with that!)

When the hot plug removal event happens, the OS is meant to unload the
drivers.

The drivers get reloaded after the hot plug insertion event. Possibly
not the same drivers as before, if the FPGA contains something else.

Your higher level application needs to be aware that the driver can come
and go with the hot plug events. You'll need some sort of mechanism to
inform the application (e.g. a signal).
Presumably the application is the actual cause of the FPGA
reconfiguration, in which case it knows when the FPGA is there or not and
doesn't need to be told.

We're still working on it. We have
implemented all the optocoupled sideband signals for hot plug, and
training packets resume after we reconnect. We're still working on it.

I found that just the presence detect was needed for reliable hot plug.
All the others are optional.

Regards,
Allan

A

#### Allan Herriman

Jan 1, 1970
0
I thought it was faster than that. If I remember, I'll measure some in
the lab tomorrow.

I looked at a trace on a board at work. I was surprised - the writes
were fast(ish) - about 100 ns was the smallest gap I saw between writes.

This seems consistent with Larkin's measurements.

I'm still surprised though - 2 us is 20000 bit times on a 4 lane gen 1.
Ok, it's only 16000 bit times before the 8B10B coding.

Maybe the switch is configured for store-and-forward rather than cut
through, or something equally silly.

Regards,
Allan

R

#### rickman

Jan 1, 1970
0
No, but it's mostly dead, as PCI will soon be. Intel busses only last a few
years each.

A *few* years? PCI has been around for 20 years!

D

Jan 1, 1970
0
Do current Intel chip sets support PCI-X? Or even PCI?

I think PCI was a *just* hard intermediary bus. All the other busses
were tertiary to it, and went though it to get to the CPU.

I think PCIe is a hard intermediary bus too but it has it's own API
practically, and I would call that pretty advanced. PCI-X is likely
fully superseded by e, but elements of the original PCI paradigm

R

#### rickman

Jan 1, 1970
0
But mobos seldom have PCI slots any more. It's all PCIe now. And
Thunderbolt will displace PCIe.

Motherboard slots are going away. Hell, motherboards are going away!

Sure, for that matter PCs are going away for the mainstream. In 10
years it will literally be like working on the Enterprise... the space
ship Enterprise. Everyone will be using tablets and pads, there just
won't be a need for the traditional PC except for specialties... like
PCB layout, lol

There won't be any busses really. It will all be wireless. Maybe it
will all be powered by a Tesla type power source too. lol

That doesn't change the fact that PCI was mainstream for well over a

BTW, are you capable of learning? Or have you reached your learning
capacity?

R

#### Robert Baer

Jan 1, 1970
0
Nico said:
Most entry level scopes consist of an FPGA running a soft processor.

You mean PCI express?
What the hell is wrong with PARALLEL?
You get the _whole_ byte/word/whatever each possible I/O cycle and do
not have to wait 20+ cycles for preamble bits, 16 data bits, stop bits
(maybe more for stupid "framing" because designer was too lazy to
enforce assumptions that would speed things up).

R

#### Robert Baer

Jan 1, 1970
0
why is it any more or less evil than big endian?

-Lasse
Just give me his feather headband..

Replies
1
Views
887
Replies
0
Views
2K
M
Replies
26
Views
2K
M
D
Replies
76
Views
3K
josephkk
J
K
Replies
4
Views
1K
Klaus Kragelund
K