Implement a multiprocessor NIOS II system on Altera FPGA

This example shows the use of multiple processors to perform different tasks in an Altera FPGA.

=> System

This is a non-hierarchical multiprocessor system running on Altera Cyclone IV EP4CE15 FPGA chip.

The system contains three processors – Main Processor, Worker A Processor and Worker B Processor.

Two buffers – input buffer and output buffer – are shared between these processors.

The main processor will keep add task package to input buffer if the buffer is not full.

The two worker processors take task package out of input buffer, and do task data processing, then fill processed task package back to output buffer.

The main processor then takes processed task package out of output buffer and prints out result message.

Input buffer and output buffer both can have maximum eight task packages.

Accessing of shared buffers  is protected by hardware mutex, one mutex for input buffer, and another mutex for output buffer.

relation of processors and buffers

relation of processors and buffers

=> Hardware (FPGA configuration)

#) Tool: Quartus II 11.0

#) FPGA: Cyclone IV: EP4CE15F17C8N

#) Major components in Qsys design:

  • Clock: 50MHz, no PLL
  • Main Processor: 6KB on-chip memory as program memory
  • Worker A Processor: 5KB on-chip memory as program memory
  • Worker B Processor: 5KB on-chip memory as program memory
  • Two Shared buffers: 256 Byte on-chip memory
  • Two mutexs: initial value is 0
Design in QSys

Design in QSys

#) Block Diagram/Schematic File (.bdf file)

No peripherals are used here, only clock and power are connected to FPGA chip.


=> Software

#) Tool: Nios II 11.0

#) Application Projects

Create one software project in Nios II for each processor, totally three projects.

Change BSP settings to reduce the application memory footprint, for example,  

  • Check “Reduced device drivers”
  • Uncheck “Support C++”
  • Check “Small C library”

#) Application Coding

Task package data

typedef struct {

    unsigned char ucCPUID; //CPU ID

    unsigned char ucOP; //Operation: no used

    unsigned char ucState; //State: full or empty

    unsigned char ucResult; //Result

    unsigned int uiNumber; //Task data for processing

}APP_DATA, *PAPP_DATA; //size: 8 byte

Main processor adds task package to input buffer

if(0==altera_avalon_mutex_trylock(mutex_inBuffer, ALT_CPU_CPU_ID_VALUE+1))

{//get input buffer mutex

    uiInBufferPos =tableClear[*INPUT_BUFFER_BITMAP];


    {//has space in input buffer

        pAppData =INPUT_DATA_BASE + uiInBufferPos;

        pAppData->uiNumber=uiInputNum; //put data in

        *INPUT_BUFFER_BITMAP=(*INPUT_BUFFER_BITMAP)|(1<<uiInBufferPos); //set the bit

        alt_printf(“Number – 0x%x is added to inBuffer slot 0x%x.\n”,



    {//input buffer full

        alt_printf(“Input Buffer is full.\n”);




Worker processors remove task package from input buffer

altera_avalon_mutex_lock(mutex_inBuffer, ALT_CPU_CPU_ID_VALUE+1);

uiInBufferPos =tableSet[*INPUT_BUFFER_BITMAP];


{//input buffer has task

    pAppData =INPUT_DATA_BASE + uiInBufferPos;

    appOutData.uiNumber=pAppData->uiNumber;//get data


    *INPUT_BUFFER_BITMAP=(*INPUT_BUFFER_BITMAP)^(1<<uiInBufferPos); //clear the bit

    alt_printf(“Number – 0x%x is removed from inBuffer slot 0x%x.\n”,



{//empty input buffer

    alt_printf(“Input Buffer is empty.\n”);



Worker processors add task package to output buffer

altera_avalon_mutex_lock(mutex_outBuffer, ALT_CPU_CPU_ID_VALUE+1);

uiOutBufferPos =tableClear[*OUTPUT_BUFFER_BITMAP];


{//output buffer has space

    pAppData =OUTPUT_DATA_BASE + uiOutBufferPos;






    alt_printf(“Number – 0x%x is added to outBuffer slot 0x%x .\n”,



{//output buffer is full

    alt_printf(“Output buffer is full.\n”);



Main processor removes task package from output buffer

if(0==altera_avalon_mutex_trylock(mutex_outBuffer, ALT_CPU_CPU_ID_VALUE+1))

{//get output buffer mutex

    uiOutBufferPos =tableSet[*OUTPUT_BUFFER_BITMAP];


    {//has task package in output buffer

        pAppData =OUTPUT_DATA_BASE + uiOutBufferPos;





        *OUTPUT_BUFFER_BITMAP= (*OUTPUT_BUFFER_BITMAP)^(1<<uiOutBufferPos); //clear the bit

        alt_printf(“Number – 0x%x is removed from outBuffer slot 0x%x.\n”,



    {//empty output buffer

        alt_printf(“Output Buffer is empty.\n”);




=> Final System

Steps to run the system

1. Power up FPGA board

2. Programming compiled hardware image (.sof file) to FPGA by using Quartus II programmer

3. Open three Nios II command shells, type command in every shell window to open terminal

“nios2-terminal –instance=0” //for terminal 0 – Main Processor

“nios2-terminal –instance=1” //for terminal 1 – Worker A Processor

“nios2-terminal –instance=2” //for terminal 2 – Worker B Processor

4. In Nios II, run each application as Nios II hardware,

Before run application, disable Nios II Console view in “Target Connection” tab of “Run Configuration” dialog

If nothing is wrong during above steps, three terminals will show messages of three processors and keep updating.





Share on Facebook

Posted in FPGA | Tagged , , , , , , , , | Leave a comment

Prepare Qt IDE for i.MX6 embedded Linux development

Desktop: Ubuntu 11.04

Qt Package: 4.7.0

Step 1: Install Qt Creator on Linux Desktop

Go to Ubuntu Software Center, find “Qt Creator”, and install it with default Add-ons

Step 2: Compile Qt Package

Please refer to

Step 3: Install compiled Qt Package

The compiled “qmake” is in “~/iMX6Linux/ltib/rpm/BUILD/qt-everywhere-opensource-src-4.7.0/bin”

The makefile used for installation is “~/iMX6Linux/ltib/rpm/BUILD/qt-everywhere-opensource-src-4.7.0/Makefile”

Go to directory “~/iMX6Linux/ltib/rpm/BUILD/qt-everywhere-opensource-src-4.7.0”, then run following commands.

$sudo make install


$sudo make install_qmake

$sudo make install_mkspecs

The qmake will be installed to “~/iMX6Linux/ltib/rootfs/usr/local/Trolltech/bin”

Step 4: Configure Qt IDE

Add i.MX6 qmake to Qt IDE.

qt creator options dialog

qt creator options dialog

Add a new build option for i.MX6 board to the project.

Step 5: Test

Develop a simple application in Qt Creator, and compile it for i.MX6 board, then run it on the board, for example,

$./test –qwe

Qt on imx6 board

Qt test application runs on i.MX6 board

Everything looks good now.



Share on Facebook

Posted in Embedded Linux, Qt | Tagged , , , | Leave a comment

Port Qt library and application to Freescale i.MX6 Sabre board

Photo first.


Qt runs on Freescale i.MX6 Sabre board

Qt application example is running on Freescale i.MX6 Sabre board in above photo.

For i.MX6 board information, please refer to Freescale website – “SABRE Platform for Smart Devices Reference Design Based on the i.MX 6 Series”

The desktop used for this porting is Ubuntu 11.04.

Step One: Build Qt package for board

1.  Prepare Qt package

a. Download Qt package from

File name: qt-everywhere-opensource-src-4.7.0.tar.gz
File size: 198.7MB

b. Copy the Qt package to /opt/freescale/pkgs

c. Edit the Qt spec file located in ~/iMX6Linux/ltib/dist/lfs-5.1/qt/qt-embedded.spec, make sure the content in that file matches with the Qt package 4.7.0

2. Reconfigure LTIB

$cd ~/iMX6Linux/ltib
$./ltib -m config

Configuration dialog will show up, then go to “Package selection”=>”Package List” => “Qt” =>”Qt (Qt Embedded)”.

Choose ” Qt Embedded “.
Have a look on Help, “PKG_TSLIB”, “PKG_ZLIB”, “PKG_FONTCONFIG”, “PKG_GLIB2”, and “PKG_ZLIB” packages are selected automatically.

3. Compile Qt package

Save the configuration and exit, then run command
This will start Qt compiling.

It is likely you will get a few compiling errors, if so, please refer to QTSetup_Eco2012.pdf at

Fix all errors you get, then recompile it.

After compilation is done, the output Qt files are located at ~/iMX6Linux/ltib/rootfs/usr/local/Trolltech/

Step Two: Run Qt application on board

1. Update file system

Flash the uboot, kernel image and root file system which has new compiled Qt to the SD card, then boot the board with that SD card.

2. Run Qt example on the board

$export QT_QWS_FONTDIR=/usr/lib/fonts
$cd /usr/local/Trolltech/examples/mainwindows/application
$./application -qws

By now, application will run, and you can see application window on display, but it can’t take input from touch screen.

3. Make touch screen working with Qt application

a. Calibrate touch screen

$export TSLIB_TSDEVICE=/dev/input/ts0

b. Add following lines to /etc/profile, and save the file, then reboot board

export TSLIB_TSDEVICE=/dev/input/ts0
export TSLIB_CONFFILE=/usr/etc/ts.conf
export TSLIB_PLUGINDIR=/usr/lib/ts
export TSLIB_CALIBFILE=/etc/pointercal
export QWS_MOUSE_PROTO=Tslib:/dev/input/ts0
export QT_QWS_FONTDIR=/usr/lib/fonts
export LD_LIBRARY_PATH=/usr/lib

c. Run application again

$./application –qws

This time, touch screen input goes to the application, user can do file open, close, etc. with finger.

If you get “Initializing QFontEngineQPF failed for /usr/lib/fonts/DejaVuSans.ttf ” error on console, try this one.

$./application -qws -fn DejaVuSans


The Qt porting is fully done here, Qt IDE setup on Ubuntu desktop will be next step for software development – Prepare Qt IDE for i.MX6 embedded Linux development



Share on Facebook

Posted in Embedded Linux, Qt | Tagged , , , , | Leave a comment

Add timeout control to TCP socket connecting in Qt

There is no interface in Qt QTcpSocket class for user to control timeout value of socket connection request.

By default, the connection request will timeout after about 50 seconds on my Windows desktop.

It is too long for user as it takes less than 1 second to connect in local network if server is online and good.

A single short timer can be used to resolve this issue.

This can be done in following three steps.

1. Create a connection timer and set it as Single Shot timer.

m_timerConnection = new QTimer(this);
connect(m_timerConnection, SIGNAL(timeout()),
this, SLOT(TimerConnection()));

2. Call “connectToHost” function, and start the 5 seconds connection timer.

void cltTCPConnection::ConnectToMachine()
//start the connection timer

3. In timeout event, check socket state. If the state is still connecting state, then abort connecting operation, and notify user or write log.

void cltTCPConnection::TimerConnection()
    {//still no connection


Share on Facebook

Posted in Programming, Qt | Tagged , , , , , , | Leave a comment

From prototype to product: How is the performance of one embedded application improved?

Let’s give the system a name first, it will be called PRN system in this post.

Background Information

  • What is this PRN system about?

The PRN system takes customized user’s message as input, then print it out.

  • How is the PRN system performance measured?

The PRN system performance is measured by one indicator – prints per second of a standard message.

  • What does the PRN system need do to finish one print?

The PRN has two subsystems – microprocessor subsystem for user interactive, message handling, printing controlling; microcontroller subsystem for sensor and print head handling.

The two subsystems are connected through I2C, SPI, and GPIO.

To finish one print cycle, microprocessor subsystem need do below steps:
1. Get READY signal from microcontroller subsystem.
2. Update message content: there are two dynamic objects in benchmark message – Counter and Date/Time, both should be updated before every printing.
3. Generate message image, and save it to memory.
4. Process message image: rotating, mirroring, converting, quality adjusting, etc.
5. Send image data to microcontroller subsystem through SPI channel.
6. Give SENT signal to microcontroller subsystem
7. Calculate ink usage of current message.

Microcontroller subsystem will receive image data and send it to print head for printing.

Starting point

All core functions are done, and system works; the application still has debug features enabled.

The performance need to be improve to enlarge customer base.

What had been done to improve performance

Change message generation algorithm

  • Old way:
    Whole message is generated every time before printing.
  • New way:
    Only generate dynamic objects image in message before printing, then merge it transparently to static background image.
  • Effect on performance:
    The new way add one extra step to image processing – merging, this make it slow for certain messages.
    In general, the new way make performance independent from message size. Normally, user only uses a few dynamic objects in one message, no matter how large the message will be.
    So the new way greatly improves performance of large messages.

After this change, the speed of standard message is about 290ms/print, roughly equal to 3 prints/second.

Adjust ink usage calculation feature

  • Old way:
    Calculation black dot number in message, and sum it up after sending image data every time.
  • New way:
    Only do ink usage calculation once after printing is stopped based on black dot number in first message; Only do print counting during printing.
  • Effect on performance:
    Ink function only takes very tiny time now.
    Side-effect: Ink usage calculation is not real time and is not accurate, it is OK as ink information is low priority and it is not accurate anyway because of other factors.

Apply a few coding techniques

1. Replace some local variables with global variables or static local variables

2. Use “inline” keyword on a few small functions: signal from and to microcontroller

3. Remove some intermediate variables in functions

4. Use count down instead of count up in loops

5. Put frequent case labels first in “switch” block

6. Reduce the number of function parameters, and no more than four


1. Switch application from debug mode to release mode, and choose strong compiler optimization.

2. Switch to higher SPI speed for communication with microcontroller subsystem.

Finishing point

After made above changes, the new performance number is about 140ms/print, roughly equal to 7 prints/second.

When PRN system runs at this maximum speed, the CPU utilization is very high, and application GUI is very slow for user.

So the final speed is limited to maximum 5 prints/second in code, this way saves enough CPU resource for user interactive, and this limited maximum speed is also more than enough for most users.

Further approaches for performance improving

1. Remove ink usage calculation from print cycle, replace it with ink usage monitoring on microcontroller side.

2. Only send image of dynamic objects to microcontroller side: this way needs a more powerful microcontroller subsystem.

3. More balanced system: add external memory to microcontroller, share and balance  burden between two subsystems, say microcontroller can have data buffer, and does some simple image processing.

4. More powerful system, say higher microprocessor CPU frequency

By the way, the performance of same application on Ubuntu virtual machine (Intel i7 @3.4GHz) is many times faster than on ARM board (ARM1176JZF-S @533MHz), say 7 times faster, one comparison number is 20ms:150ms

So, it is likely the performance can be improved further to 20 prints/second if needed, this higher speed will be good for a PRN system which manages more microcontrollers or print heads.



Share on Facebook

Posted in Embedded Linux | Tagged , | Leave a comment