SlideShare a Scribd company logo
1 of 6
Download to read offline
High-Performance NoC Interface with Interrupt Batching for
           Micronmesh MPSoC Prototype Platform on FPGA

                                                       Heikki Kariniemi and Jari Nurmi
                                                      Department of Computer systems
                                                     Tampere University of Technology
                                                              Tampere, Finland
                                                  Email: {heikki.kariniemi, jari.nurmi}@tut.fi

Abstract—This paper presents a new NoC Interface (NI) targeted               reducing the software overhead produced by the interrupt processing
for improving the performance of the Micronmesh                              and the processor utilization. The usage of the jumbo frames, i.e. large
Multiprocessor System-on-Chip (MPSoC). The previous version                  messages, makes it possible to reduce the message rate and the
of the NI called Micronswitch Interface (MSI) can zero-copy                  interrupt frequency [2, 3, 5]. The fragmentation is related to the jumbo
messages as it sends and receives them. It offloads also some                frames which are usually fragmented to smaller frames before sending
functionalities of the communication protocol from software                  [1, 3, 5]. The MSIQ HW also fragments the messages to small fixed
(SW) to hardware (HW), but interrupt processing produces extra               sized packets as it sends them to the Micronmesh NoC and assembles
SW overhead and reduces the performance. For this reason, an                 the received messages from the received packets.
improved version of the MSI called MSI-with-Queues (MSIQ)                         The interrupt coalescing [2, 3, 5] is a technique used for batching
was designed with a new queue mechanism in order to reduce the               interrupt service requests so that every execution of the ISR could
frequency of interrupts and the SW overhead. Owing to the new                serve several requests, which reduces the interrupt frequency and the
queue mechanism of the MSIQ it is possible to batch and service              software overhead. It has also variants called Interrupt Multiplexing
multiple interrupt service requests by every execution of the                [1] and Enabling Disabling (ED) technique [4]. In a typical
Interrupt Service Routine (ISR). Additionally, the new MSIQ                  implementation the interrupts are delayed until a certain amount of
HW is able to send and receive messages while the processor is               interrupts has been batched or a timeout expires. The implementation
running the ISR. The performance of the MSIQ is also analyzed                used in the new MSIQ works slightly differently. When receiving the
in this paper. The results show that the queue mechanism                     messages, the MSIQ generates an interrupt immediately after it has
improves the performance with moderate hardware costs.                       received a new message. If more messages arrive or have arrived in
                                                                             bursts during the execution of the ISR, they are also served. This
                          I.    INTRODUCTION                                 method provides a low latency and a good burst tolerance against
                                                                             bursts of short messages in addition to the reduced interrupt frequency.
     In computer systems where computers are connected by high-              When sending the messages, the MSIQ sends several messages
speed networks the operation of the network interfaces may become a          successively in batches. It generates the interrupts after finishing the
main obstacle for the communication throughput and the performance.          sending of the first message of the batches, which makes it possible to
This is because the communication between the CPUs and the network           start running the ISR while the sending is still continued. As a
interfaces produces extra software overhead. Several methods like, for       consequence of this, the ISR can also be running concurrently with the
example, zero-copying, protocol offloading, jumbo frames, message            MSIQ HW, which improves the performance further.
fragmentation, and interrupt coalescing have been presented in
literature [1, 2, 3, 4, 5, 6, 7] for eliminating this problem. Due to            In the MSIQ the interrupt coalescing is implemented with send-
certain similarities of architectures these same methods can be used         request and receive-request queues. The results of the performance
for solving the same problem in the MPSoCs where distributed                 analysis and the logic synthesis presented in this paper show that the
memory and message-passing communication architectures are used.             improved performance is achieved with small additional HW costs
                                                                             compared to the old MSI [12]. The MSIQ could also be used with
     In the Micronmesh MPSoC platform [8] the tightly coupled                polling, but polling is usually used with interrupts and more difficult
operation of the Micron Message-Passing (MMP) protocol [9] and the           to implement [6, 7]. Furthermore, the length of the polling period must
MSIQ enables direct message transfers between the local variables of         be carefully adapted to the message rate in order to achieve a good
the user threads and the MSIQ which is a technique called zero-              performance, because if it is too long, the communication latency
copying in the literature [1, 2, 3, 5]. The zero-copying reduces             grows, and if it is too short, the software overhead grows.
communication latency and improves the performance, because it
eliminates copying of messages from user memory to MSIQ through                  This paper is organized as follows. Section II presents the
intermediate buffers in the kernel memory. The multiplexing and              architecture and the operation of the new MSIQ. Section III presents
demultiplexing functions of the MMP protocol are also offloaded to           the performance analysis and the HW costs of the new MSIQ, and
the MSIQ HW in order to reduce software overhead. Protocol                   finally, Section IV concludes this paper.
offloading is used for speeding up the protocol functions by HW and
for reducing the software overhead [1, 2, 3, 4, 5].                                   II.    MICRONSWITCH INTERFACE WITH QUEUES
    The interrupt-driven systems provide low latency and low SW                  The Micronmesh MPSoC platforms [8] consist of Micronmesh
overhead if the interrupt rate is low, but the performance degrades if       nodes that contain a local NIOS II processor [13], local on-chip
the interrupt frequency grows. Interrupts produce additional SW              memories, a timer, a local Avalon system bus [14], the MSIQ, and the
overhead by causing context switching from a user mode to a kernel           Micronswitch [8]. The NIOS II processors are running distinct
mode before the execution of the ISR and back to the user mode from          MicroC/OS II real-time kernels [11] in every Micronmesh node. The
the kernel mode after the execution of the ISR is finished [1, 2, 3, 4, 5,   MSIQs connect the Micronmesh nodes to the Micronmesh NoC
6, 7, 10, 11]. The last three methods mentioned above are used for           through the local Micronswitches.
  This research is funded by the Academy of Finland under grant
122361.




 978-1-4244-8971-8/10$26.00 c 2010 IEEE
A. The Architecture of the MSIQ                                            (LOCAL MEMORY), fragments the messages, generates packets of
                                                                           the fragments, and writes the packets to the Tx-FIFO from which the
    The MSIQ consists of three main sub-blocks which are the MSIQ
                                                                           MSIQ Tx-master’s Tx-interface (TX-IF) sends them to the
Rx-master, the MSIQ Tx-master, and the MSIQ Slave. It is depicted
                                                                           Micronswitch. Packets consist of two headers and two payload words
on the bottom of schematic Fig. 1. The MSIQ Rx-master on the left
                                                                           [9, 12]. The addresses of the messages are passed to the MSIQ Tx-
receives messages from the NoC, the MSIQ Tx-master on the right
sends messages to the NoC, and the MSIQ Slave in the middle is used        master’s Avalon interface through the Tx-base-address-FIFO. In Fig.
for controlling and configuring the operations of the MSIQ Masters         1 this address points to the beginning of the Tx-buffer A of thread B,
through the MSIQ’s register interface. The MSIQ Slave is also              which is illustrated by arrow A. The routing headers and the protocol
responsible for generating interrupt service requests according to the     control headers of the packets are stored into the Tx-routing-header-
MSIQ Masters status.                                                       FIFO and the Tx-protocol-control-header-FIFO. The control register
                                                                           values are passed through the Tx-control-FIFO. After finishing the
                                                                           sending of the message, the MSIQ Tx-master changes its status in
                                                                           order to make the MSIQ Slave to generate an interrupt service request,
                                                                           reads the next send-request from the HW send-request queue, and
                                                                           continues sending of messages till the HW send-request queue
                                                                           becomes empty. It can continue sending while the processor is running
                                                                           the ISR. The maximum size of the message batches depends on the
                                                                           size of the HW send-request queue. The larger the HW send-request
                                                                           queue the more messages can be sent without interrupts. If only one
                                                                           message could be sent at a time, the execution time of the interrupts
                                                                           would dominate the total sending time especially if the messages
                                                                           would be short [12]. Hence, owing to the HW send-request queues it is
                                                                           possible to reduce the interrupt frequency and improve the
                                                                           performance.

                                                                                   TABLE I.        MSIQ’S REGISTER INTERFACE AND QUEUES

                                                                              Register                            Description
                                                                            MSIQ-status       The common status register of the MSIQ Masters.
                                                                                              The control register used for controlling the MSIQ
                                                                            Rx-control
                                                                                              Rx-master’s operation.
                                                                            Rx-base-
                                                                                              The base-address of the Rx-buffer table.
                                                                            address
                                                                                              The Rx-routing header of the last packet of the
                                                                            Rx-routing-       received message. This register is part of the
                                                                            header            receive-requests queue and it is the output of the
                                                                                              Rx-routing-header-FIFO.
                                                                            Rx-               The Rx-protocol-control header of the last packet
                                                                            protocol-         of the received message. This register is part of the
                                                                            control-          receive-request queue and it is the output of the Rx-
                                                                            header            protocol-control-header-FIFO.
                                                                                              The Tx-control register used for controlling the
                                                                                              MSIQ Tx-master’s operation. This register is part
                                                                            Tx-control
                                                                                              of the send-request queue and it is the input of the
                Figure 1. The architecture of the MSIQ.
                                                                                              Tx-control-FIFO.
                                                                                              The start address of the message stored into the Tx-
    The MSIQ’s register interface is partly presented in Table 1. It        Tx-base-          buffer. This register is part of the send-request
contains a status register MSIQ-status which is a combined status of        address           queue and it is the input of the Tx-base-address-
the MSIQ Masters. The values of the Tx-control, the Tx-base-address,                          FIFO.
the Tx-routing-header, and the Tx-protocol-control-header registers                           The Tx-routing header template of the packets of
form send-requests that are stored to the HW send-request queue of          Tx-routing-       the message to be sent. This register is part of the
the MSIQ HW (HW SEND-REQUEST QUEUE). The writing of                         header            send-request queue and it is the input of Tx-routing-
these registers starts the sending of one message. Respectively, the
                                                                                              header-FIFO.
values of the Rx-routing-header and the Rx-protocol-control-header
                                                                                              The Tx-protocol-control header template of the
registers form the receive-requests that are stored into the HW receive-    Tx-protocol-
request queue of the MSIQ HW (HW RECEIVE-REQUEST                                              packets of the message to be sent. This register is
                                                                            control -
QUEUE). The reading of these registers ends the receiving of one                              part of the send-request queue and it is the input of
                                                                            header
message. The MSIQ Slave contains also four FIFOs for storing the                              the Tx-protocol-control-header-FIFO.
send-requests and two FIFOs for storing the receive-requests like
Table I explains.                                                              The MSIQ Rx-master’s Rx-interface (RX-IF) receives packets
    The MSIQ Tx-master starts sending messages as it receives send-        from the Micronswitch and writes them to the Rx-FIFO. The MSIQ
requests through the HW send-request queue from the MSIQ Slave.            Rx-master’s Avalon interface (AVA-RX-IF) reads the packets from
The MSIQ Tx-master’s Avalon Interface (AVA-TX-IF) reads the                the Rx-FIFO and writes the packet payloads to the Rx-buffers which
messages directly from the Tx-buffers from the local memory                are in the local memory. It obtains the Rx-buffer addresses from the
local memory from the Rx-buffer table (RX-BUFFER TABLE),                     and sends them to the Micronswitch. After the sending of the message
which is referred by the Rx-base-address register like arrow B in Fig.       is finished the MSIQ Tx-master’s Avalon interface changes its status
1 illustrates, and computes the storage addresses of the packet              and the MSIQ Slave generates an interrupt service request accordingly
payloads. When doing this, the MSIQ Rx-master’s Avalon interface             which starts the execution of the MSIQ ISR in step four. If the HW
demultiplexes and assembles the messages of different Rx-channels            send-request queue is not empty yet, the MSIQ Tx-master’s Avalon
from one input packet stream through the Rx-FIFO to multiple Rx-             interface reads the next send-request from it and continues sending
buffers. The Channel Identifiers (CID) of the protocol control headers       messages until the queue is empty while the processor is running the
are used for addressing the Rx-buffer table elements like arrow C            msiq_isr (ISR) in step four.
illustrates. The Rx-buffer table elements contain the Rx-buffer
addresses like arrow D illustrates. They are used by the MSIQ Rx-            4. The processor starts running msiq_isr (ISR). The msiq_isr
master for addressing the Rx-buffers like arrow E illustrates. After         acknowledges the interrupt service request, reads the address of the
finishing the receiving of a message, the MSIQ Rx-master’s Avalon            signaling semaphore from the Tx-serviced queue, and posts the
interface writes the receive-request to the HW receive-request queue         signaling semaphore to the thread, which called the mmpp_send
and changes its status in order to make the MSIQ Slave to generate an        function. This wakes up the thread and the mmpp_send function
interrupt service request. If the HW receive-request queue is not full,      returns. If the SW send-request queue is not empty, the msiq_isr reads
the receiving can be continued while the processor is running the ISR.       the next send-request from it, stores the address of the Tx-channel’s
Since every execution of the ISR can service multiple receive-requests       signaling semaphore to the Tx-serviced queue, and writes the next
the number of interrupts can be reduced. This happens especially if          send-request to the HW send-request queue, which enables the
messages are short and several messages arrive in bursts between the         sending and the interrupts again. These operations are repeated in a
consecutive executions of the ISR. Furthermore, the performance              loop until all of the signaling semaphores of the serviced send-requests
improves also, because the receiving needs to be stopped less                have been posted from the Tx-serviced queue and either the HW send-
frequently.                                                                  request queue is full or the SW send-request queue is empty.
                                                                                 As steps three and four show the HW send-request queue enables
B. The MSIQ device driver and the MMP protocol                               the interrupt batching. Additionally, the Tx-buffers are mapped to the
                                                                             local variables of the threads and the MSIQ HW uses DMA (Direct
    The main parts of the MSIQ device driver (MSIQ SW) are a state
                                                                             Memory Access) transfers for zero-copying the messages directly
data structure, send (msiq_send) and receive (msiq_receive) functions,       from the Tx-buffers. The MSIQ also slices the messages into packets
and the ISR (msiq_isr). The MSIQ SW is used by the MMP protocol’s            as it multiplexes and sends them in one packet stream to the
functions for controlling the operations of the MSIQ. The MMP                Micronmesh NoC, which implements message fragmentation.
protocol is a messaging layer protocol which forms an Application
Programming Interface (API) for programming fault-tolerant message-
passing applications [9]. This API contains, for example, functions for      D. Receiving of messages
sending (mmpp_send) and receiving (mmpp_receive) messages. The                   The messages are received in the following way.
MSIQ SW’s state data structure contains also a SW send-request
queue and a Tx-serviced queue. In the SW send-request queue the              1. A thread calls the mmpp_receive function, which prepares the
send-requests are pointers to the data structures of the MMP protocol’s      Rx-channel for receiving by deasserting the lock bit and by updating
channels [9] which contain the register values of the send-requests to       the address field of the Rx-channel’s Rx-buffer table element. Then it
be stored into the HW send-request queue. The elements of the Tx-            calls the msiq_receive function of the MSIQ SW which enables the
serviced queue are pointers to the Tx-channels’ signaling semaphores.        MSIQ Rx-master to receive messages.
                                                                             2. The MSIQ Rx-master’s Rx-interface receives packets from the
C. Sending of messages                                                       Micronswitch and writes them to the Rx-FIFO. The Rx-master’s
    The messages are sent in the following way.                              Avalon interface reads the packets from the Rx-FIFO one by one,
                                                                             computes the addresses of the Rx-buffer table elements by adding the
1. A thread calls the mmpp_send function which calls the msiq_send           packets’ CIDs multiplied by four to the Rx-base-address register’s
function of the MSIQ SW.                                                     value, and reads the Rx-buffer table elements from the local memory.
                                                                             Then it multiplies packets’ sequence numbers carried in the protocol
2. The msiq_send function puts at first the address of the Tx-               control headers by eight and the address field of the Rx-buffer table
channel’s data structure to the SW send-request queue. Then it reads         element by four. The sums of these two products are the storage
the status of the MSIQ. If the MSIQ Tx-master is idle, it reads the          addresses of the packet payloads. These multiplications are performed
send-request from the SW send-request queue, stores the address of           by simple shift left operations. After computing the storage addresses,
the Tx-channel’s signaling semaphore to the Tx-serviced queue, and           the MSIQ Rx-master writes the packet payloads to the Rx-buffers. If
writes the send-request to the HW send-request queue. This enables           successive packets have the same CID, the Rx-master can reuse the
the MSIQ Tx-master to send and operation continues in step three. If         Rx-buffer table element and only the storage address must be
the MSIQ Tx-master is not idle, the msiq_send lets the ISR (msiq_isr)        computed again for each of the packets separately. Otherwise, the Rx-
of the MSI device driver to initialize the sending of the next message       buffer table elements must be read from the memory. After the last
as the processor starts running it in step four after the previous send is   packet of the message is received, the MSIQ Rx-master’s Avalon
finished, and returns. The accessing of the MSIQ SW’s state data             interface asserts the lock bit and updates the address field of the Rx-
structure and the MSIQ’s register interface is controlled by a               buffer table element to point to the end of the message, writes the Rx-
semaphore so that they can be accessed only by one thread at a time or       buffer table element to the memory, writes the receive-request to the
the msiq_isr. Additionally, because the msiq_isr has also higher             HW receive-request queue, and changes its status in order to make the
priority than the threads, it can be guaranteed that the MSIQ SW’s           MSI Slave to generate an interrupt service request. Then it continues
data structures and queues are maintained correctly.                         receiving messages until the HW receive-request queue is full while
3. The MSIQ Tx-master’s Avalon interface reads the send-request              the msiq_isr (ISR) is executed in step three.
from the HW send-request queue and starts reading a message from             3. The processor starts running the msiq_isr (ISR) function. The
the Tx-buffer, slices it into packet payloads, generates both of the         msiq_isr acknowledges the MSI Rx-master’s interrupt service request,
headers for every packet, and writes complete packets to the Tx-FIFO.        reads the receive-request from the HW receive-request queue, obtains
The MSIQ Tx-master’s Tx-interface reads packets from the Tx-FIFO             the address of the Rx-channel’s data structure by the CID from the
MSIQ SW’s data structure, and posts the Rx-channel’s signaling               = Npck × 5 clock cycles. Owing to this simplification and because the
semaphore to the thread that called mmpp_receive function. These             interfaces operate at the same clock rate, it is not any longer necessary
operations are repeated in a loop until the HW receive-request queue is      to take into consideration the filling of the Tx-FIFO and the emptying
empty or a certain maximum number of receive-requests are serviced.          of the Rx-FIFO.
    Hence, the HW receive-request queue of the MSIQ can be used
for batching the interrupts. The MSIQ Rx-master’s Avalon interface           B. The performance of the MSIQ SW and HW
also offloads the MMP protocol’s functions partly by using the Rx-               In the performance analysis a couple of things must be taken into
buffer table for demultiplexing interleaved packets of different             consideration. Firstly, the length of the messages and the size of the
channels from a single input packet stream according to the CIDs.            queues Qsize affect the theoretic maximum throughput. Secondly, the
Furthermore, because the Rx-channels’ Rx-buffers are mapped to the           MSIQ masters can receive and send messages while the local
local variables of the threads [9], it can use DMA for zero-copying and      processors are running the ISR. Additionally, the ISR (msiq_isr)
assembling the messages to the Rx-buffers.                                   consists of different Tx-ISR and Rx-ISR branches for servicing
                                                                             interrupts caused by the MSIQ Tx-master and the MSIQ Rx-master as
                                                                             was described in sections II.C and II.D.
                   III.   PERFORMANCE ANALYSIS
    A theoretic approach is used for estimating the performances of               The execution time of the Tx-ISR is
the MSI and the MSIQs. This is because several factors like, for             Ttx-isr (n) = Ttx-start + n × Ttx-loop,                           (1)
example, the operation speed of memories, the size of cache
memories, the operation delay of interrupt logic etc. affect the             where Ttx-start is the time consumed in the beginning of the execution of
performance and measurements with only one configuration would not           the ISR before the Tx-loop iterations and where n = 1, …, Qsize is the
produce reliable estimates. However, the execution time of the ISR           number of serviced send-requests. Parameter Qsize is also the
was measured for calculations with a simple platform where the MSIQ          maximum batch size and Ttx-loop is the execution time of the Tx-ISR’s
Masters were connected to different ports of a dual-port on-chip             Tx-loop executed in step four of sending as described in subsection
SRAM which contained the buffers. Furthermore, the program code              II.C. The sending of other messages generates new interrupt service
and data were stored to a different single-port on-chip SRAM. The            requests, but they are masked during the execution of the ISR.
performance analysis is targeted for comparing the operations, the
                                                                                  The service time of the Tx-interrupts is
costs, and the performances of the new MSIQ and the MSI.
                                                                             Ttx-int (n) = Tres + Ttx-isr (n) + Trec,                          (2)
    The theoretic maximum throughputs with messages of different
sizes represent the peak communication performances achievable               where parameter Tres is the response time between the assertion of the
when as many messages as possible are sent or received continuously.         interrupt request and the start of the ISR’s execution, and Trec is the
In the first step of the analysis the performance of the MSIQ HW is          interrupt recovery time. If NIOS II/f (fast) core is used, parameter Tres
analyzed. The result of the first step is used for simplifying the second    = 105 clock cycles and parameter Trec = 62 clock cycles [10].
step of the performance analysis where the performance of both the
MSIQ HW and the MSIQ SW is analyzed together.                                     The execution time of the Rx-ISR is
                                                                             Trx-isr (n) = Trx-start + n × Trx-loop,                           (3)
A. The performance of the MSIQ HW
                                                                             where Trx-start is the time consumed in the beginning of the execution
     As messages are sent the MSIQ Tx-master’s Avalon interface              of the ISR before the Rx-loop iterations and where n = 1, …, Qsize is
reads packet payloads of two words from the Tx-buffers, generates            the number of Rx-ISR’s Rx-loop iterations which is limited by the size
packets, and stores the packets to the Tx-FIFO. After storing the last       of queues Qsize. Parameter Trx-loop is the time consumed by each of the
packet of the message to the Tx-FIFO, it changes its status in order to      Rx-loop iterations executed in step three of receiving as described in
make the MSIQ Slave to generate an interrupt. The latency of reading         subsection II.D. The receiving of new messages generates also
the payloads of Npck packets is Dread(Npck) = Npck×4 +2 clock cycles.        receive-requests, but the interrupts are masked during the execution of
This includes the time required for generating and storing Npck packets      the ISR.
to the Tx-FIFO. The latency of sending Npck packets from the Tx-FIFO
to the Micronswitch is Dsend(Npck) = Npck×5 clock cycles respectively.            The service time of the Rx-interrupts is
Since Dread(Npck) ≤ Dsend(Npck), when Npck ≥ 2, it can be concluded that
                                                                             Trx-int (n) = Tres + Trx-isr (n) + Trec,                          (4)
the MSIQ Tx-master’s Tx-interface limits the throughput.
                                                                             where parameters n, Tres, and Trec are equal to those of formula (2).
     The MSIQ Rx-master’s Avalon interface reads packets from the
Rx-FIFO, reads the Rx-buffer table elements and computes the storage              In the performance analysis the operation of the MSIQ HW and
addresses, and writes the packet payloads to the Rx-buffers. After the       SW can be divided into periods during which the MSIQ masters send
last packet of a message it changes its status in order to make the MSI      or receive a certain number of messages and the ISR is executed once.
Slave to generate an interrupt. The latency of writing the payloads of       The length of the periods is denoted by Tperiod (n), where n = 1, …,
Npck packets to the Rx-buffer is Dwrite(Npck) = 2 + Npck×2 + 2 clock         Qsize is the number of serviced send-requests or receive-requests, i.e.
cycles. The latency of receiving Npck packets through the Rx-interface       the batch size. The length of the period is determined by the execution
of the MSIQ Rx-master (RX-IF) is Dreceive(Npck) = Npck×5 clock               time of the interrupt services or the time required for sending or
cycles. Since Dwrite(Npck) ≤ Dreceive(Npck), when Npck ≥ 2, it can be        receiving n messages. The value of parameter n is floating and its
concluded that the MSIQ Rx-master’s Rx-interface limits the                  value depends also on the message size. The length of the period
throughput.                                                                  determines the theoretic maximum message rate
    As was shown the Tx-interface and the Rx-interface of the MSIQ           Rmsg (n) = n / Tperiod (n)                                        (5)
Masters limit the throughputs like in the original MSI [8]. Therefore,
in order to simplify the performance analysis of the MSIQ HW and             and the theoretic maximum bit rate
SW it can be assumed that the processing of every packet takes five          Rbit (n) = Msize × Rmsg (n) = Msize × n / Tperiod (n),            (6)
clock cycles also by both of the Avalon interfaces of the MSIQ
Masters and that Dread(Npck) = Dsend(Npck) = Dwrite(Npck) = Dreceive(Npck)
where n = 1, …, Qsize and parameter Msize is the message size in bits.         the interrupt services are requested and the throughput of the MSIQ
The theoretic maximum bit rate Rbit (n) is the theoretic maximum               Rx-master.
throughput. Formulas of the maximum theoretic throughputs are
derived for sending and receiving separately in the following two                   In the case that messages are shorter, the interrupt service time is
subsections.                                                                   longer than the receiving time of Qsize messages and Trx-int (Qsize) >
                                                                               Qsize × Trx-msg. In this case the HW receive-request queue is full most
  1) The throughput with the send-request queue                                of the time and the MSIQ Rx-master must stop receiving until the Rx-
                                                                               ISR’s Rx-loop iterations read receive-requests from the HW receive-
                                                                               request queue. The interrupt service time Trx-int (n) determines clearly
     If Ttx-int (Qsize) = Qsize × Ttx-msg, where parameter Ttx-msg =           the length of the periods and Tperiod (n) = Trx-int (n). Because at most
Dsend(Npck) is the sending time of a message as defined in subsection          Qsize receive-requests can be read from the HW receive-request queue
III.A, the MSIQ Tx-master is able to send messages continuously                and Qsize messages can be received during the periods, the theoretic
without stopping the sending while processors is running the Tx-ISR.           maximum throughput is achieved with value n = Qsize and Tperiod (Qsize)
The HW send-request queue can never be emptied by the MSIQ Tx-                 = Trx-int (Qsize). Hence, the theoretic maximum throughput is
master, because the processor runs the Tx-ISR which puts new send-
requests to the HW send-request queue from the SW send-request                 Rbit (Qsize) = Msize × Qsize / Trx-int (Qsize).                      (9)
queue. The MSIQ Tx-master generates interrupts after every sending                  In the case that messages are longer, the interrupt service time can
of a message, but these interrupt service requests are masked if               be shorter than the receiving time of Qsize messages and Trx-int (Qsize) ≤
processor is running the ISR. The performance analysis of the MSIQ             Qsize × Trx-msg. Because the processors can service the receive-requests
Tx-master consists of two separate cases, where either Ttx-int (Qsize) >       of Qsize messages in a shorter time than the MSIQ Rx-master can
Qsize × Ttx-msg or Ttx-int (Qsize) ≤ Qsize × Ttx-msg, since the message size   receive the next Qsize messages, the receiving can be continued without
affects the rate at which the interrupt services are requested and the         stops and the receive-request queue can never become full. Finally, if
throughput of the MSIQ Tx-master.                                              the message size is further increased, the Rx-loop is executed only
     In the case that messages are shorter, the interrupt service time is      once during every execution of the Rx-ISR and Trx-int (1) ≤ Trx-msg.
longer than the sending time of Qsize messages and Ttx-int (Qsize) > Qsize     Hence, if Trx-int (Qsize) ≤ Qsize × Trx-msg, the message size determines the
× Ttx-msg. In this case the HW send-request queue is emptied and the           number of received messages n during the periods and the length of
MSIQ Tx-master must stop sending messages until the Tx-ISR puts                the period Tperiod (n) = n × Trx-msg, where n = 1, …, Qsize. Thus, the
the next send-requests into the HW send-request queue. Thus, with              theoretic maximum message rate is Rmsg (n) = n / (n × Trx-msg) = 1 /
shorter messages the interrupt service time Ttx-int (n) determines the         Trx-msg and the theoretic maximum throughput is
length of the period and Tperiod (n) = Ttx-int (n). The message rate is Rmsg   Rbit (n) = Msize × Rmsg (n) = Msize / Trx-msg.                       (10)
(n) = n / Tperiod (n) = n / Ttx-int (n), where n = 1, …, Qsize, and the bit
rate is Rbit (n) = Msize × Rmsg (n). The theoretic maximum throughput is
achieved with value n = Qsize, when the ISR loads Qsize send-requests          C. Comparison of performances and costs
to the HW send-request queue, and the theoretic maximum throughput                  The performances of the MSIQ and the MSI are presented in Fig.
is                                                                             2 where the horizontal axis shows the message size in 32 bits wide
                                                                               words and the vertical axis shows the throughputs in GBits/s. The
Rbit (Qsize) = Msize × Rmsg (Qsize) = Msize × Qsize / Ttx-int (Qsize).   (7)
                                                                               throughputs were computed with 100 MHz clock. The throughputs of
     In the case that messages are longer, the interrupt service time can      the basic MSI, which does not have the queues, are presented with
be smaller than the sending time of the messages and Ttx-int (Qsize) ≤         lines Q1(300) and Q1(600). These lines are computed like in [13] with
Qsize × Ttx-msg. Because the Tx-ISR can put a larger number of send-           interrupt service times (Ttx-int, Trx-int) of 300 and 600 clock cycles. The
requests to the HW send-request queue than the MSIQ Tx-master can              throughputs of the MSIQ with queues of four send-requests and
send during the interrupt service time Ttx-int (Qsize), the HW send-           receive-requests are presented with lines Q4(450) and Q4(900). These
request queue is nonempty most of the time and the sending can                 lines are computed with equal Tx-loop and Rx-loop execution times
continue without stops. Because the number of Tx-loop iterations of            (Ttx-loop, Trx-loop) of 450 and 900 clock cycles, and with the ISR start
the Tx-ISR depends on the message size which determines the sending            times (Ttx-start, Trx-start) of 20 clock cycles. The throughputs of the
time, parameter n can also be smaller than Qsize. Hence, the sending           MSIQ with the queues of eight requests are not presented, since they
time of the messages determines the length of the period Tperiod (n) = n       are quite similar to those of Q4(450) and Q4(900). This is because the
× Ttx-msg, where n = 1, …, Qsize, and the theoretic maximum message            total execution times of the loops dominate the total interrupt service
rate Rmsg (n) = n / Tperiod (n) = n / (n × Ttx-msg) = 1 / Ttx-msg, where n =   times as the number of loop iterations increases, which reduces the
1, …, Qsize. In this case the theoretic maximum throughput does not            effect of the other delay parameters. The threshold message sizes of
depend on the value of parameter n and it is                                   Q4(450) and Q4(900) are 199 and 379 words respectively. With the
                                                                               threshold message sizes Ttx-int (Qsize) = Qsize × Ttx-msg = Qsize ×
Rbit (n) = Msize × Rmsg (n) = Msize / Ttx-msg.                           (8)   Dsend(Npck) and Trx-int (Qsize) = Qsize × Trx-msg = Qsize × Dreceive(Npck).
                                                                               Thus, with 100 MHz clock the throughputs or the MSIQ saturate to
  2) The throughput with the receive-request queue                             1.28 GBits/s actually with smaller messages than Fig. 2 presents.
                                                                               Formulas (7) and (9) are used for computing the throughputs of the
    If Trx-int (Qsize) = Qsize × Trx-msg, where parameter Trx-msg =            MSIQ for message sizes that are smaller than the threshold values and
Dreceive(Npck) is the receiving time of a message as defined in                formulas (8) and (10) are used for computing the throughputs with
subsection III.A, the MSIQ Rx-master is able to receive the next Qsize         message sizes that are higher than or equal to the thresholds.
messages without stopping the receiving while the processor is                     By comparing line Q1(300) to line Q4(450) and line Q1(600) to
running the ISR. This is because each interrupt services Qsize receive-        line Q4(900) it can be concluded that with messages which are smaller
requests while the MSIQ Rx-master receives the next Qsize messages.            than 64 and 128 words the theoretic maximum throughputs of the
The MSIQ Rx-master generates new interrupt service request after               basic MSI and the MSIQ are quite similar. However, the throughputs
receiving of messages, but these interrupt service requests are masked         Q4(450) and Q4(900) of the MSIQ grow much faster as the message
if processor is running the ISR. The analysis divides also into two            size is increased and they saturate to 1.28 GBits/s already at the point
separate cases, where either Trx-int (Qsize) > Qsize × Trx-msg or Trx-int      of 256 and 512 words. Furthermore, the results in Fig. 2 do not show
(Qsize) ≤ Qsize × Trx-msg, since the message size affects the rate at which    the performance with message bursts. Because usually traffic contains
also bursts of messages, it is necessary that the NI is able to achieve a   costs. It would also be possible to reduce the HW costs by using
high peak performance for short time intervals under burst traffic. This    smaller send-request queues in the MSIQ without reducing the
can be achieved by HW send-request and HW receive-request queues.           performance significantly.
For example, with queues of eight requests the MSIQ Masters are able
to send and receive bursts of eight messages at the maximum rate
without stopping their operation.                                                                  ACKNOWLEDGMENT
                                                                               This research is funded by the Academy of Finland under grant
                                                                            122361.

                                                                                                        REFERENCES

                                                                            [1] Z.D. Dittia, G.M. Parulkar, and J.R. Cox, “The APIC Approach
                                                                                 to High Performance Interface Design: Protected DMA and
                                                                                 Other Techniques,” Proc. of the IEEE International Conference
                                                                                 on Computer Communications, Kobe, Japan, Apr. 7-12, 1997,
                                                                                 pp. 823-831.
                                                                            [2] A.F. Diaz, J. Ortega, A. Canas, F.J. Fernandez, M. Anguita, and
                                                                                 A. Prieto, “The lightweight Protocol CLIC on Gigabit Ethernet,”
                                                                                 Proc. of the International Parallel and Distributed Processing
                                                                                 Symposium, Nice, France, Apr. 22-26, 2003, pp. 8.
                                                                            [3] P. Gilfeather, and A.B. Maccabe, “Modeling Protocol Offload
  Figure 2. Theoretic maximum throughput of the MSI and the MSIQ.                for Message-Oriented Communication,” Proc. of the IEEE
                                                                                 Internatonal Conference on Cluster Computing, Burlington,
                                                                                 Masschusets, USA, Sept. 27-30, 2005, pp. 1-10.
    The synthesis results are in Table 2. The MSIQs and the MSI
contain Tx-FIFOs and Rx-FIFOs of four packets. The logic and                [4] S.A. AlQahtani, “Performance Evaluation of Handling Interrupts
register consumptions of the MSIQs and the MSI are quite similar, but            Schemes in Gigabit Networks,” Proc. of the IEEE International
                                                                                 Conference on Computer and Information Technology, Aizu-
the amount of block memory bits grows clearly as the size of the                 Wakamatsu, Fukushima, Japan, Oct. 16-19, 2007, pp. 497-502.
queues is increased. The maximum size of the queues is 16 requests.
With queues of that size the MSIQ would consume 4096 block                  [5] B. Coglin, and N. Furmento, “Finding a Tradeoff between Host
                                                                                 Interrupt load and MPI Latency over Ethernet,” Proc. of the
memory bits, but it would provide also better theoretic maximum                  IEEE International Conference on Cluster Computing, New
throughput and burst tolerance. Additionally, it would be possible to            Orleans, Lousiana, USA, Aug. 31-Sept. 4, 2009, pp. 1-9.
use smaller HW send-request queue so as to reduce the HW costs,
                                                                            [6] J. Mogul, and K.K. Ramakrishnan, Eliminating Receive livelock
because the SW send-request queue can store a large number of send-              in an Interrupt Driven Kernel, ACM transactions on Computer
requests in any case. For example, with the HW send-request queue of             Systems, Vol. 15, No. 3, Aug. 1997, pp. 217-252.
four requests and the HW receive-request queue of 16 requests the
                                                                            [7] K. Langendoen, J. Romein, R. Bhoedjang, and H. Bal,
MSIQ would consume also 2560 block memory bits.                                  “Integrating Polling, Interrupts, and Thread Management,” Proc.
                                                                                 of the Frontiers of Massively Parallel Computing symposium,
                                                                                 Annapolis, MD, USA, Oct. 27-31, 1996, pp. 13-22.
 TABLE II.       RESOURCE CONSUMPTIONS IN STRATIX III EP3SL150 [15]
                                                                            [8] H. Kariniemi, and J. Nurmi, “Micronmesh for Fault-tolerant
                        MSI           MSIQ with          MSIQ with               GALS Multiprocessors on FPGA,” Proc. of the International
 FPGA resource                                                                   Symposium on System-on-Chip, Tampere, Finland, Nov. 4-6,
                       Qsize = 1       Qsize = 4          Qsize = 8
 Combinational                                                                   2008, pp. 1-8.
                         1550         1665 (7.4%)       1695 (9.3%)         [9] H. Kariniemi, and J. Nurmi, “Fault-Tolerant Communication
 ALUTs
                                                                                 over Micronmesh NoC with Micron Message-Passing protocol,”
 Memory ALUTs              0            0 (0.0%)          0 (0.0%)               Proc. of the 11th internation symposium on System-on-Chip,
                                                                                 Tampere, Finland, Oct. 5-7, 2009, pp. 5–12.
 Logic registers         1454         1609 (10.6%)      1609 (10.6%)
                                                                            [10] Altera Corp., NIOS II software developers handbook, Mach
 Block memory                                                                    2009. Website, <http://www.pldworld.com/_Semiconductors/
                         1024         1792 (75.0%)     2560 (150.0%)
 bits                                                                            Altera/one_click_niosII_docs_9_0/files/n2sw_nii5v2.pdf>
                                                                                 20.08.2010
                         IV.       CONCLUSIONS                              [11] J. Labrosse, MicroC/OS-II The real-time kernel, Second ed.,
    This paper presents MSIQ NI where a new queue mechanism is                   CMP Books, San Francisco, USA, 2002.
used for batching interrupts in order to improve the performance.           [12] H. Kariniemi, and J. Nurmi, “NoC Interface for Fault-Tolernt
Interrupts generated by the NIs produce a lot of SW overhead and the             Message-Passing Communication on Multiprocessor SoC
performance can be improved by reducing the interrupt frequency.                 platform,” Proc. of the NORCHIP, Trondheim, Norway, Nov.
This is achieved by the send-request and the receive-request queues              2009.
which make it possible to batch interrupt service requests so that          [13] Altera Corp., NIOS II processor reference handbook, November
individual ISR executions can serve multiple interrupt requests. The             2009, Website,        <http://www.altera.com/literature/hb/nios2/
throughput improves especially with longer messages. Furthermore,                n2cpu_nii5v1.pdf> 20.08.2010
the burst tolerance against short messages improves. In addition to the     [14] Altera Corp., Quartus II Handbook v10.0, Ch. 2: System
interrupt batching this is also partly owing to that the request queues          interconnect fabric for memory-mapped interfaces, July 2010,
allow the MSIQ HW to continue sending and receiving messages                     Website,                 <http://www.altera.com/literature/hb/qts/
                                                                                 qts_qii54003.pdf > 20.08.2010
while processor is running the ISR. Hence, the new queue mechanism
enables more efficient concurrent operation of the MSIQ HW and the          [15] Altera Corp., Stratix III device handbook, Volume I, San Jose,
SW. The results of the performance analysis and the logic synthesis              USA, July 2010. Website, <http://www.altera.com/literature/hb/
                                                                                 stx3/stratix3_handbook.pdf> 20.08.2010
show also clearly that the performance can be improved with tolerable

More Related Content

What's hot

Priority based bandwidth allocation in wireless sensor networks
Priority based bandwidth allocation in wireless sensor networksPriority based bandwidth allocation in wireless sensor networks
Priority based bandwidth allocation in wireless sensor networksIJCNCJournal
 
Run-Time Adaptive Processor Allocation of Self-Configurable Intel IXP2400 Net...
Run-Time Adaptive Processor Allocation of Self-Configurable Intel IXP2400 Net...Run-Time Adaptive Processor Allocation of Self-Configurable Intel IXP2400 Net...
Run-Time Adaptive Processor Allocation of Self-Configurable Intel IXP2400 Net...CSCJournals
 
A dynamic performance-based_flow_control
A dynamic performance-based_flow_controlA dynamic performance-based_flow_control
A dynamic performance-based_flow_controlingenioustech
 
Improved SCTP Scheme To Overcome Congestion Losses Over Manet
Improved SCTP Scheme To Overcome Congestion Losses Over ManetImproved SCTP Scheme To Overcome Congestion Losses Over Manet
Improved SCTP Scheme To Overcome Congestion Losses Over ManetIJERA Editor
 
Stefano Giordano
Stefano GiordanoStefano Giordano
Stefano GiordanoGoWireless
 
Optimization of Remote Core Locking Synchronization in Multithreaded Programs...
Optimization of Remote Core Locking Synchronization in Multithreaded Programs...Optimization of Remote Core Locking Synchronization in Multithreaded Programs...
Optimization of Remote Core Locking Synchronization in Multithreaded Programs...ITIIIndustries
 
Handover Behaviour of Transparent Relay in WiMAX Networks
Handover Behaviour of Transparent Relay in WiMAX NetworksHandover Behaviour of Transparent Relay in WiMAX Networks
Handover Behaviour of Transparent Relay in WiMAX NetworksIDES Editor
 
FPGA IMPLEMENTATION OF RECOVERY BOOSTING TECHNIQUE TO ENHANCE NBTI RECOVERY I...
FPGA IMPLEMENTATION OF RECOVERY BOOSTING TECHNIQUE TO ENHANCE NBTI RECOVERY I...FPGA IMPLEMENTATION OF RECOVERY BOOSTING TECHNIQUE TO ENHANCE NBTI RECOVERY I...
FPGA IMPLEMENTATION OF RECOVERY BOOSTING TECHNIQUE TO ENHANCE NBTI RECOVERY I...Editor IJMTER
 
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORSAFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORScscpconf
 
Affect of parallel computing on multicore processors
Affect of parallel computing on multicore processorsAffect of parallel computing on multicore processors
Affect of parallel computing on multicore processorscsandit
 
Fingerprinting Based Indoor Positioning System using RSSI Bluetooth
Fingerprinting Based Indoor Positioning System using RSSI BluetoothFingerprinting Based Indoor Positioning System using RSSI Bluetooth
Fingerprinting Based Indoor Positioning System using RSSI Bluetoothijsrd.com
 
LOW AREA FPGA IMPLEMENTATION OF DROMCSLA-QTL ARCHITECTURE FOR CRYPTOGRAPHIC A...
LOW AREA FPGA IMPLEMENTATION OF DROMCSLA-QTL ARCHITECTURE FOR CRYPTOGRAPHIC A...LOW AREA FPGA IMPLEMENTATION OF DROMCSLA-QTL ARCHITECTURE FOR CRYPTOGRAPHIC A...
LOW AREA FPGA IMPLEMENTATION OF DROMCSLA-QTL ARCHITECTURE FOR CRYPTOGRAPHIC A...IJNSA Journal
 

What's hot (18)

Ft nmdoc
Ft nmdocFt nmdoc
Ft nmdoc
 
Priority based bandwidth allocation in wireless sensor networks
Priority based bandwidth allocation in wireless sensor networksPriority based bandwidth allocation in wireless sensor networks
Priority based bandwidth allocation in wireless sensor networks
 
Run-Time Adaptive Processor Allocation of Self-Configurable Intel IXP2400 Net...
Run-Time Adaptive Processor Allocation of Self-Configurable Intel IXP2400 Net...Run-Time Adaptive Processor Allocation of Self-Configurable Intel IXP2400 Net...
Run-Time Adaptive Processor Allocation of Self-Configurable Intel IXP2400 Net...
 
A dynamic performance-based_flow_control
A dynamic performance-based_flow_controlA dynamic performance-based_flow_control
A dynamic performance-based_flow_control
 
Improved SCTP Scheme To Overcome Congestion Losses Over Manet
Improved SCTP Scheme To Overcome Congestion Losses Over ManetImproved SCTP Scheme To Overcome Congestion Losses Over Manet
Improved SCTP Scheme To Overcome Congestion Losses Over Manet
 
Stefano Giordano
Stefano GiordanoStefano Giordano
Stefano Giordano
 
Ax24329333
Ax24329333Ax24329333
Ax24329333
 
Optimization of Remote Core Locking Synchronization in Multithreaded Programs...
Optimization of Remote Core Locking Synchronization in Multithreaded Programs...Optimization of Remote Core Locking Synchronization in Multithreaded Programs...
Optimization of Remote Core Locking Synchronization in Multithreaded Programs...
 
Handover Behaviour of Transparent Relay in WiMAX Networks
Handover Behaviour of Transparent Relay in WiMAX NetworksHandover Behaviour of Transparent Relay in WiMAX Networks
Handover Behaviour of Transparent Relay in WiMAX Networks
 
73
7373
73
 
FPGA IMPLEMENTATION OF RECOVERY BOOSTING TECHNIQUE TO ENHANCE NBTI RECOVERY I...
FPGA IMPLEMENTATION OF RECOVERY BOOSTING TECHNIQUE TO ENHANCE NBTI RECOVERY I...FPGA IMPLEMENTATION OF RECOVERY BOOSTING TECHNIQUE TO ENHANCE NBTI RECOVERY I...
FPGA IMPLEMENTATION OF RECOVERY BOOSTING TECHNIQUE TO ENHANCE NBTI RECOVERY I...
 
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORSAFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
 
Affect of parallel computing on multicore processors
Affect of parallel computing on multicore processorsAffect of parallel computing on multicore processors
Affect of parallel computing on multicore processors
 
Fingerprinting Based Indoor Positioning System using RSSI Bluetooth
Fingerprinting Based Indoor Positioning System using RSSI BluetoothFingerprinting Based Indoor Positioning System using RSSI Bluetooth
Fingerprinting Based Indoor Positioning System using RSSI Bluetooth
 
20 24
20 2420 24
20 24
 
LOW AREA FPGA IMPLEMENTATION OF DROMCSLA-QTL ARCHITECTURE FOR CRYPTOGRAPHIC A...
LOW AREA FPGA IMPLEMENTATION OF DROMCSLA-QTL ARCHITECTURE FOR CRYPTOGRAPHIC A...LOW AREA FPGA IMPLEMENTATION OF DROMCSLA-QTL ARCHITECTURE FOR CRYPTOGRAPHIC A...
LOW AREA FPGA IMPLEMENTATION OF DROMCSLA-QTL ARCHITECTURE FOR CRYPTOGRAPHIC A...
 
109 113
109 113109 113
109 113
 
Tmac
TmacTmac
Tmac
 

Viewers also liked (8)

52
5252
52
 
41
4141
41
 
55
5555
55
 
53
5353
53
 
62
6262
62
 
49
4949
49
 
94
9494
94
 
My profile
My profileMy profile
My profile
 

Similar to 61

A cross layer optimized reliable multicast routing protocol in wireless mesh ...
A cross layer optimized reliable multicast routing protocol in wireless mesh ...A cross layer optimized reliable multicast routing protocol in wireless mesh ...
A cross layer optimized reliable multicast routing protocol in wireless mesh ...ijdpsjournal
 
Fpga implementation of scalable queue manager
Fpga implementation of scalable queue managerFpga implementation of scalable queue manager
Fpga implementation of scalable queue managerIAEME Publication
 
Fpga implementation of scalable queue manager
Fpga implementation of scalable queue managerFpga implementation of scalable queue manager
Fpga implementation of scalable queue manageriaemedu
 
ENERGY EFFICIENT MULTICAST ROUTING IN MANET
ENERGY EFFICIENT MULTICAST ROUTING IN MANET ENERGY EFFICIENT MULTICAST ROUTING IN MANET
ENERGY EFFICIENT MULTICAST ROUTING IN MANET ijac journal
 
Design and Implementation of Multistage Interconnection Networks for SoC Netw...
Design and Implementation of Multistage Interconnection Networks for SoC Netw...Design and Implementation of Multistage Interconnection Networks for SoC Netw...
Design and Implementation of Multistage Interconnection Networks for SoC Netw...IJCSEIT Journal
 
PHASE-PRIORITY BASED DIRECTORY COHERENCE FOR MULTICORE PROCESSOR
PHASE-PRIORITY BASED DIRECTORY COHERENCE FOR MULTICORE PROCESSORPHASE-PRIORITY BASED DIRECTORY COHERENCE FOR MULTICORE PROCESSOR
PHASE-PRIORITY BASED DIRECTORY COHERENCE FOR MULTICORE PROCESSORijcseit
 
PHASE-PRIORITY BASED DIRECTORY COHERENCE FOR MULTICORE PROCESSOR
PHASE-PRIORITY BASED DIRECTORY COHERENCE FOR MULTICORE PROCESSORPHASE-PRIORITY BASED DIRECTORY COHERENCE FOR MULTICORE PROCESSOR
PHASE-PRIORITY BASED DIRECTORY COHERENCE FOR MULTICORE PROCESSORijad journal
 
A secure scheme against power exhausting
A secure scheme against power exhaustingA secure scheme against power exhausting
A secure scheme against power exhaustingjpstudcorner
 
Performance evaluation of qos in
Performance evaluation of qos inPerformance evaluation of qos in
Performance evaluation of qos incaijjournal
 
Performance Evaluation of a Layered WSN Using AODV and MCF Protocols in NS-2
Performance Evaluation of a Layered WSN Using AODV and MCF Protocols in NS-2Performance Evaluation of a Layered WSN Using AODV and MCF Protocols in NS-2
Performance Evaluation of a Layered WSN Using AODV and MCF Protocols in NS-2csandit
 
2004 qof is_mpls_ospf
2004 qof is_mpls_ospf2004 qof is_mpls_ospf
2004 qof is_mpls_ospfAdi Nugroho
 
Hardback solution to accelerate multimedia computation through mgp in cmp
Hardback solution to accelerate multimedia computation through mgp in cmpHardback solution to accelerate multimedia computation through mgp in cmp
Hardback solution to accelerate multimedia computation through mgp in cmpeSAT Publishing House
 
Switching and multicast schemes in asynchronous transfer mode networks
Switching and multicast schemes in asynchronous transfer mode networksSwitching and multicast schemes in asynchronous transfer mode networks
Switching and multicast schemes in asynchronous transfer mode networksEditor Jacotech
 
Implementation of High Speed OFDM Transceiver using FPGA
Implementation of High Speed OFDM Transceiver using FPGAImplementation of High Speed OFDM Transceiver using FPGA
Implementation of High Speed OFDM Transceiver using FPGAMangaiK4
 
Analysis of Women Harassment inVillages Using CETD Matrix Modal
Analysis of Women Harassment inVillages Using CETD Matrix ModalAnalysis of Women Harassment inVillages Using CETD Matrix Modal
Analysis of Women Harassment inVillages Using CETD Matrix ModalMangaiK4
 

Similar to 61 (20)

A cross layer optimized reliable multicast routing protocol in wireless mesh ...
A cross layer optimized reliable multicast routing protocol in wireless mesh ...A cross layer optimized reliable multicast routing protocol in wireless mesh ...
A cross layer optimized reliable multicast routing protocol in wireless mesh ...
 
3
33
3
 
Blade
BladeBlade
Blade
 
Fpga implementation of scalable queue manager
Fpga implementation of scalable queue managerFpga implementation of scalable queue manager
Fpga implementation of scalable queue manager
 
Fpga implementation of scalable queue manager
Fpga implementation of scalable queue managerFpga implementation of scalable queue manager
Fpga implementation of scalable queue manager
 
ENERGY EFFICIENT MULTICAST ROUTING IN MANET
ENERGY EFFICIENT MULTICAST ROUTING IN MANET ENERGY EFFICIENT MULTICAST ROUTING IN MANET
ENERGY EFFICIENT MULTICAST ROUTING IN MANET
 
Design and Implementation of Multistage Interconnection Networks for SoC Netw...
Design and Implementation of Multistage Interconnection Networks for SoC Netw...Design and Implementation of Multistage Interconnection Networks for SoC Netw...
Design and Implementation of Multistage Interconnection Networks for SoC Netw...
 
PHASE-PRIORITY BASED DIRECTORY COHERENCE FOR MULTICORE PROCESSOR
PHASE-PRIORITY BASED DIRECTORY COHERENCE FOR MULTICORE PROCESSORPHASE-PRIORITY BASED DIRECTORY COHERENCE FOR MULTICORE PROCESSOR
PHASE-PRIORITY BASED DIRECTORY COHERENCE FOR MULTICORE PROCESSOR
 
1
11
1
 
PHASE-PRIORITY BASED DIRECTORY COHERENCE FOR MULTICORE PROCESSOR
PHASE-PRIORITY BASED DIRECTORY COHERENCE FOR MULTICORE PROCESSORPHASE-PRIORITY BASED DIRECTORY COHERENCE FOR MULTICORE PROCESSOR
PHASE-PRIORITY BASED DIRECTORY COHERENCE FOR MULTICORE PROCESSOR
 
A secure scheme against power exhausting
A secure scheme against power exhaustingA secure scheme against power exhausting
A secure scheme against power exhausting
 
Performance evaluation of qos in
Performance evaluation of qos inPerformance evaluation of qos in
Performance evaluation of qos in
 
Performance Evaluation of a Layered WSN Using AODV and MCF Protocols in NS-2
Performance Evaluation of a Layered WSN Using AODV and MCF Protocols in NS-2Performance Evaluation of a Layered WSN Using AODV and MCF Protocols in NS-2
Performance Evaluation of a Layered WSN Using AODV and MCF Protocols in NS-2
 
2004 qof is_mpls_ospf
2004 qof is_mpls_ospf2004 qof is_mpls_ospf
2004 qof is_mpls_ospf
 
Hardback solution to accelerate multimedia computation through mgp in cmp
Hardback solution to accelerate multimedia computation through mgp in cmpHardback solution to accelerate multimedia computation through mgp in cmp
Hardback solution to accelerate multimedia computation through mgp in cmp
 
50120140505008
5012014050500850120140505008
50120140505008
 
Switching and multicast schemes in asynchronous transfer mode networks
Switching and multicast schemes in asynchronous transfer mode networksSwitching and multicast schemes in asynchronous transfer mode networks
Switching and multicast schemes in asynchronous transfer mode networks
 
Blade
BladeBlade
Blade
 
Implementation of High Speed OFDM Transceiver using FPGA
Implementation of High Speed OFDM Transceiver using FPGAImplementation of High Speed OFDM Transceiver using FPGA
Implementation of High Speed OFDM Transceiver using FPGA
 
Analysis of Women Harassment inVillages Using CETD Matrix Modal
Analysis of Women Harassment inVillages Using CETD Matrix ModalAnalysis of Women Harassment inVillages Using CETD Matrix Modal
Analysis of Women Harassment inVillages Using CETD Matrix Modal
 

More from srimoorthi (20)

87
8787
87
 
84
8484
84
 
83
8383
83
 
82
8282
82
 
75
7575
75
 
72
7272
72
 
70
7070
70
 
69
6969
69
 
68
6868
68
 
63
6363
63
 
60
6060
60
 
59
5959
59
 
57
5757
57
 
56
5656
56
 
50
5050
50
 
51
5151
51
 
45
4545
45
 
44
4444
44
 
43
4343
43
 
42
4242
42
 

61

  • 1. High-Performance NoC Interface with Interrupt Batching for Micronmesh MPSoC Prototype Platform on FPGA Heikki Kariniemi and Jari Nurmi Department of Computer systems Tampere University of Technology Tampere, Finland Email: {heikki.kariniemi, jari.nurmi}@tut.fi Abstract—This paper presents a new NoC Interface (NI) targeted reducing the software overhead produced by the interrupt processing for improving the performance of the Micronmesh and the processor utilization. The usage of the jumbo frames, i.e. large Multiprocessor System-on-Chip (MPSoC). The previous version messages, makes it possible to reduce the message rate and the of the NI called Micronswitch Interface (MSI) can zero-copy interrupt frequency [2, 3, 5]. The fragmentation is related to the jumbo messages as it sends and receives them. It offloads also some frames which are usually fragmented to smaller frames before sending functionalities of the communication protocol from software [1, 3, 5]. The MSIQ HW also fragments the messages to small fixed (SW) to hardware (HW), but interrupt processing produces extra sized packets as it sends them to the Micronmesh NoC and assembles SW overhead and reduces the performance. For this reason, an the received messages from the received packets. improved version of the MSI called MSI-with-Queues (MSIQ) The interrupt coalescing [2, 3, 5] is a technique used for batching was designed with a new queue mechanism in order to reduce the interrupt service requests so that every execution of the ISR could frequency of interrupts and the SW overhead. Owing to the new serve several requests, which reduces the interrupt frequency and the queue mechanism of the MSIQ it is possible to batch and service software overhead. It has also variants called Interrupt Multiplexing multiple interrupt service requests by every execution of the [1] and Enabling Disabling (ED) technique [4]. In a typical Interrupt Service Routine (ISR). Additionally, the new MSIQ implementation the interrupts are delayed until a certain amount of HW is able to send and receive messages while the processor is interrupts has been batched or a timeout expires. The implementation running the ISR. The performance of the MSIQ is also analyzed used in the new MSIQ works slightly differently. When receiving the in this paper. The results show that the queue mechanism messages, the MSIQ generates an interrupt immediately after it has improves the performance with moderate hardware costs. received a new message. If more messages arrive or have arrived in bursts during the execution of the ISR, they are also served. This I. INTRODUCTION method provides a low latency and a good burst tolerance against bursts of short messages in addition to the reduced interrupt frequency. In computer systems where computers are connected by high- When sending the messages, the MSIQ sends several messages speed networks the operation of the network interfaces may become a successively in batches. It generates the interrupts after finishing the main obstacle for the communication throughput and the performance. sending of the first message of the batches, which makes it possible to This is because the communication between the CPUs and the network start running the ISR while the sending is still continued. As a interfaces produces extra software overhead. Several methods like, for consequence of this, the ISR can also be running concurrently with the example, zero-copying, protocol offloading, jumbo frames, message MSIQ HW, which improves the performance further. fragmentation, and interrupt coalescing have been presented in literature [1, 2, 3, 4, 5, 6, 7] for eliminating this problem. Due to In the MSIQ the interrupt coalescing is implemented with send- certain similarities of architectures these same methods can be used request and receive-request queues. The results of the performance for solving the same problem in the MPSoCs where distributed analysis and the logic synthesis presented in this paper show that the memory and message-passing communication architectures are used. improved performance is achieved with small additional HW costs compared to the old MSI [12]. The MSIQ could also be used with In the Micronmesh MPSoC platform [8] the tightly coupled polling, but polling is usually used with interrupts and more difficult operation of the Micron Message-Passing (MMP) protocol [9] and the to implement [6, 7]. Furthermore, the length of the polling period must MSIQ enables direct message transfers between the local variables of be carefully adapted to the message rate in order to achieve a good the user threads and the MSIQ which is a technique called zero- performance, because if it is too long, the communication latency copying in the literature [1, 2, 3, 5]. The zero-copying reduces grows, and if it is too short, the software overhead grows. communication latency and improves the performance, because it eliminates copying of messages from user memory to MSIQ through This paper is organized as follows. Section II presents the intermediate buffers in the kernel memory. The multiplexing and architecture and the operation of the new MSIQ. Section III presents demultiplexing functions of the MMP protocol are also offloaded to the performance analysis and the HW costs of the new MSIQ, and the MSIQ HW in order to reduce software overhead. Protocol finally, Section IV concludes this paper. offloading is used for speeding up the protocol functions by HW and for reducing the software overhead [1, 2, 3, 4, 5]. II. MICRONSWITCH INTERFACE WITH QUEUES The interrupt-driven systems provide low latency and low SW The Micronmesh MPSoC platforms [8] consist of Micronmesh overhead if the interrupt rate is low, but the performance degrades if nodes that contain a local NIOS II processor [13], local on-chip the interrupt frequency grows. Interrupts produce additional SW memories, a timer, a local Avalon system bus [14], the MSIQ, and the overhead by causing context switching from a user mode to a kernel Micronswitch [8]. The NIOS II processors are running distinct mode before the execution of the ISR and back to the user mode from MicroC/OS II real-time kernels [11] in every Micronmesh node. The the kernel mode after the execution of the ISR is finished [1, 2, 3, 4, 5, MSIQs connect the Micronmesh nodes to the Micronmesh NoC 6, 7, 10, 11]. The last three methods mentioned above are used for through the local Micronswitches. This research is funded by the Academy of Finland under grant 122361. 978-1-4244-8971-8/10$26.00 c 2010 IEEE
  • 2. A. The Architecture of the MSIQ (LOCAL MEMORY), fragments the messages, generates packets of the fragments, and writes the packets to the Tx-FIFO from which the The MSIQ consists of three main sub-blocks which are the MSIQ MSIQ Tx-master’s Tx-interface (TX-IF) sends them to the Rx-master, the MSIQ Tx-master, and the MSIQ Slave. It is depicted Micronswitch. Packets consist of two headers and two payload words on the bottom of schematic Fig. 1. The MSIQ Rx-master on the left [9, 12]. The addresses of the messages are passed to the MSIQ Tx- receives messages from the NoC, the MSIQ Tx-master on the right sends messages to the NoC, and the MSIQ Slave in the middle is used master’s Avalon interface through the Tx-base-address-FIFO. In Fig. for controlling and configuring the operations of the MSIQ Masters 1 this address points to the beginning of the Tx-buffer A of thread B, through the MSIQ’s register interface. The MSIQ Slave is also which is illustrated by arrow A. The routing headers and the protocol responsible for generating interrupt service requests according to the control headers of the packets are stored into the Tx-routing-header- MSIQ Masters status. FIFO and the Tx-protocol-control-header-FIFO. The control register values are passed through the Tx-control-FIFO. After finishing the sending of the message, the MSIQ Tx-master changes its status in order to make the MSIQ Slave to generate an interrupt service request, reads the next send-request from the HW send-request queue, and continues sending of messages till the HW send-request queue becomes empty. It can continue sending while the processor is running the ISR. The maximum size of the message batches depends on the size of the HW send-request queue. The larger the HW send-request queue the more messages can be sent without interrupts. If only one message could be sent at a time, the execution time of the interrupts would dominate the total sending time especially if the messages would be short [12]. Hence, owing to the HW send-request queues it is possible to reduce the interrupt frequency and improve the performance. TABLE I. MSIQ’S REGISTER INTERFACE AND QUEUES Register Description MSIQ-status The common status register of the MSIQ Masters. The control register used for controlling the MSIQ Rx-control Rx-master’s operation. Rx-base- The base-address of the Rx-buffer table. address The Rx-routing header of the last packet of the Rx-routing- received message. This register is part of the header receive-requests queue and it is the output of the Rx-routing-header-FIFO. Rx- The Rx-protocol-control header of the last packet protocol- of the received message. This register is part of the control- receive-request queue and it is the output of the Rx- header protocol-control-header-FIFO. The Tx-control register used for controlling the MSIQ Tx-master’s operation. This register is part Tx-control of the send-request queue and it is the input of the Figure 1. The architecture of the MSIQ. Tx-control-FIFO. The start address of the message stored into the Tx- The MSIQ’s register interface is partly presented in Table 1. It Tx-base- buffer. This register is part of the send-request contains a status register MSIQ-status which is a combined status of address queue and it is the input of the Tx-base-address- the MSIQ Masters. The values of the Tx-control, the Tx-base-address, FIFO. the Tx-routing-header, and the Tx-protocol-control-header registers The Tx-routing header template of the packets of form send-requests that are stored to the HW send-request queue of Tx-routing- the message to be sent. This register is part of the the MSIQ HW (HW SEND-REQUEST QUEUE). The writing of header send-request queue and it is the input of Tx-routing- these registers starts the sending of one message. Respectively, the header-FIFO. values of the Rx-routing-header and the Rx-protocol-control-header The Tx-protocol-control header template of the registers form the receive-requests that are stored into the HW receive- Tx-protocol- request queue of the MSIQ HW (HW RECEIVE-REQUEST packets of the message to be sent. This register is control - QUEUE). The reading of these registers ends the receiving of one part of the send-request queue and it is the input of header message. The MSIQ Slave contains also four FIFOs for storing the the Tx-protocol-control-header-FIFO. send-requests and two FIFOs for storing the receive-requests like Table I explains. The MSIQ Rx-master’s Rx-interface (RX-IF) receives packets The MSIQ Tx-master starts sending messages as it receives send- from the Micronswitch and writes them to the Rx-FIFO. The MSIQ requests through the HW send-request queue from the MSIQ Slave. Rx-master’s Avalon interface (AVA-RX-IF) reads the packets from The MSIQ Tx-master’s Avalon Interface (AVA-TX-IF) reads the the Rx-FIFO and writes the packet payloads to the Rx-buffers which messages directly from the Tx-buffers from the local memory are in the local memory. It obtains the Rx-buffer addresses from the
  • 3. local memory from the Rx-buffer table (RX-BUFFER TABLE), and sends them to the Micronswitch. After the sending of the message which is referred by the Rx-base-address register like arrow B in Fig. is finished the MSIQ Tx-master’s Avalon interface changes its status 1 illustrates, and computes the storage addresses of the packet and the MSIQ Slave generates an interrupt service request accordingly payloads. When doing this, the MSIQ Rx-master’s Avalon interface which starts the execution of the MSIQ ISR in step four. If the HW demultiplexes and assembles the messages of different Rx-channels send-request queue is not empty yet, the MSIQ Tx-master’s Avalon from one input packet stream through the Rx-FIFO to multiple Rx- interface reads the next send-request from it and continues sending buffers. The Channel Identifiers (CID) of the protocol control headers messages until the queue is empty while the processor is running the are used for addressing the Rx-buffer table elements like arrow C msiq_isr (ISR) in step four. illustrates. The Rx-buffer table elements contain the Rx-buffer addresses like arrow D illustrates. They are used by the MSIQ Rx- 4. The processor starts running msiq_isr (ISR). The msiq_isr master for addressing the Rx-buffers like arrow E illustrates. After acknowledges the interrupt service request, reads the address of the finishing the receiving of a message, the MSIQ Rx-master’s Avalon signaling semaphore from the Tx-serviced queue, and posts the interface writes the receive-request to the HW receive-request queue signaling semaphore to the thread, which called the mmpp_send and changes its status in order to make the MSIQ Slave to generate an function. This wakes up the thread and the mmpp_send function interrupt service request. If the HW receive-request queue is not full, returns. If the SW send-request queue is not empty, the msiq_isr reads the receiving can be continued while the processor is running the ISR. the next send-request from it, stores the address of the Tx-channel’s Since every execution of the ISR can service multiple receive-requests signaling semaphore to the Tx-serviced queue, and writes the next the number of interrupts can be reduced. This happens especially if send-request to the HW send-request queue, which enables the messages are short and several messages arrive in bursts between the sending and the interrupts again. These operations are repeated in a consecutive executions of the ISR. Furthermore, the performance loop until all of the signaling semaphores of the serviced send-requests improves also, because the receiving needs to be stopped less have been posted from the Tx-serviced queue and either the HW send- frequently. request queue is full or the SW send-request queue is empty. As steps three and four show the HW send-request queue enables B. The MSIQ device driver and the MMP protocol the interrupt batching. Additionally, the Tx-buffers are mapped to the local variables of the threads and the MSIQ HW uses DMA (Direct The main parts of the MSIQ device driver (MSIQ SW) are a state Memory Access) transfers for zero-copying the messages directly data structure, send (msiq_send) and receive (msiq_receive) functions, from the Tx-buffers. The MSIQ also slices the messages into packets and the ISR (msiq_isr). The MSIQ SW is used by the MMP protocol’s as it multiplexes and sends them in one packet stream to the functions for controlling the operations of the MSIQ. The MMP Micronmesh NoC, which implements message fragmentation. protocol is a messaging layer protocol which forms an Application Programming Interface (API) for programming fault-tolerant message- passing applications [9]. This API contains, for example, functions for D. Receiving of messages sending (mmpp_send) and receiving (mmpp_receive) messages. The The messages are received in the following way. MSIQ SW’s state data structure contains also a SW send-request queue and a Tx-serviced queue. In the SW send-request queue the 1. A thread calls the mmpp_receive function, which prepares the send-requests are pointers to the data structures of the MMP protocol’s Rx-channel for receiving by deasserting the lock bit and by updating channels [9] which contain the register values of the send-requests to the address field of the Rx-channel’s Rx-buffer table element. Then it be stored into the HW send-request queue. The elements of the Tx- calls the msiq_receive function of the MSIQ SW which enables the serviced queue are pointers to the Tx-channels’ signaling semaphores. MSIQ Rx-master to receive messages. 2. The MSIQ Rx-master’s Rx-interface receives packets from the C. Sending of messages Micronswitch and writes them to the Rx-FIFO. The Rx-master’s The messages are sent in the following way. Avalon interface reads the packets from the Rx-FIFO one by one, computes the addresses of the Rx-buffer table elements by adding the 1. A thread calls the mmpp_send function which calls the msiq_send packets’ CIDs multiplied by four to the Rx-base-address register’s function of the MSIQ SW. value, and reads the Rx-buffer table elements from the local memory. Then it multiplies packets’ sequence numbers carried in the protocol 2. The msiq_send function puts at first the address of the Tx- control headers by eight and the address field of the Rx-buffer table channel’s data structure to the SW send-request queue. Then it reads element by four. The sums of these two products are the storage the status of the MSIQ. If the MSIQ Tx-master is idle, it reads the addresses of the packet payloads. These multiplications are performed send-request from the SW send-request queue, stores the address of by simple shift left operations. After computing the storage addresses, the Tx-channel’s signaling semaphore to the Tx-serviced queue, and the MSIQ Rx-master writes the packet payloads to the Rx-buffers. If writes the send-request to the HW send-request queue. This enables successive packets have the same CID, the Rx-master can reuse the the MSIQ Tx-master to send and operation continues in step three. If Rx-buffer table element and only the storage address must be the MSIQ Tx-master is not idle, the msiq_send lets the ISR (msiq_isr) computed again for each of the packets separately. Otherwise, the Rx- of the MSI device driver to initialize the sending of the next message buffer table elements must be read from the memory. After the last as the processor starts running it in step four after the previous send is packet of the message is received, the MSIQ Rx-master’s Avalon finished, and returns. The accessing of the MSIQ SW’s state data interface asserts the lock bit and updates the address field of the Rx- structure and the MSIQ’s register interface is controlled by a buffer table element to point to the end of the message, writes the Rx- semaphore so that they can be accessed only by one thread at a time or buffer table element to the memory, writes the receive-request to the the msiq_isr. Additionally, because the msiq_isr has also higher HW receive-request queue, and changes its status in order to make the priority than the threads, it can be guaranteed that the MSIQ SW’s MSI Slave to generate an interrupt service request. Then it continues data structures and queues are maintained correctly. receiving messages until the HW receive-request queue is full while 3. The MSIQ Tx-master’s Avalon interface reads the send-request the msiq_isr (ISR) is executed in step three. from the HW send-request queue and starts reading a message from 3. The processor starts running the msiq_isr (ISR) function. The the Tx-buffer, slices it into packet payloads, generates both of the msiq_isr acknowledges the MSI Rx-master’s interrupt service request, headers for every packet, and writes complete packets to the Tx-FIFO. reads the receive-request from the HW receive-request queue, obtains The MSIQ Tx-master’s Tx-interface reads packets from the Tx-FIFO the address of the Rx-channel’s data structure by the CID from the
  • 4. MSIQ SW’s data structure, and posts the Rx-channel’s signaling = Npck × 5 clock cycles. Owing to this simplification and because the semaphore to the thread that called mmpp_receive function. These interfaces operate at the same clock rate, it is not any longer necessary operations are repeated in a loop until the HW receive-request queue is to take into consideration the filling of the Tx-FIFO and the emptying empty or a certain maximum number of receive-requests are serviced. of the Rx-FIFO. Hence, the HW receive-request queue of the MSIQ can be used for batching the interrupts. The MSIQ Rx-master’s Avalon interface B. The performance of the MSIQ SW and HW also offloads the MMP protocol’s functions partly by using the Rx- In the performance analysis a couple of things must be taken into buffer table for demultiplexing interleaved packets of different consideration. Firstly, the length of the messages and the size of the channels from a single input packet stream according to the CIDs. queues Qsize affect the theoretic maximum throughput. Secondly, the Furthermore, because the Rx-channels’ Rx-buffers are mapped to the MSIQ masters can receive and send messages while the local local variables of the threads [9], it can use DMA for zero-copying and processors are running the ISR. Additionally, the ISR (msiq_isr) assembling the messages to the Rx-buffers. consists of different Tx-ISR and Rx-ISR branches for servicing interrupts caused by the MSIQ Tx-master and the MSIQ Rx-master as was described in sections II.C and II.D. III. PERFORMANCE ANALYSIS A theoretic approach is used for estimating the performances of The execution time of the Tx-ISR is the MSI and the MSIQs. This is because several factors like, for Ttx-isr (n) = Ttx-start + n × Ttx-loop, (1) example, the operation speed of memories, the size of cache memories, the operation delay of interrupt logic etc. affect the where Ttx-start is the time consumed in the beginning of the execution of performance and measurements with only one configuration would not the ISR before the Tx-loop iterations and where n = 1, …, Qsize is the produce reliable estimates. However, the execution time of the ISR number of serviced send-requests. Parameter Qsize is also the was measured for calculations with a simple platform where the MSIQ maximum batch size and Ttx-loop is the execution time of the Tx-ISR’s Masters were connected to different ports of a dual-port on-chip Tx-loop executed in step four of sending as described in subsection SRAM which contained the buffers. Furthermore, the program code II.C. The sending of other messages generates new interrupt service and data were stored to a different single-port on-chip SRAM. The requests, but they are masked during the execution of the ISR. performance analysis is targeted for comparing the operations, the The service time of the Tx-interrupts is costs, and the performances of the new MSIQ and the MSI. Ttx-int (n) = Tres + Ttx-isr (n) + Trec, (2) The theoretic maximum throughputs with messages of different sizes represent the peak communication performances achievable where parameter Tres is the response time between the assertion of the when as many messages as possible are sent or received continuously. interrupt request and the start of the ISR’s execution, and Trec is the In the first step of the analysis the performance of the MSIQ HW is interrupt recovery time. If NIOS II/f (fast) core is used, parameter Tres analyzed. The result of the first step is used for simplifying the second = 105 clock cycles and parameter Trec = 62 clock cycles [10]. step of the performance analysis where the performance of both the MSIQ HW and the MSIQ SW is analyzed together. The execution time of the Rx-ISR is Trx-isr (n) = Trx-start + n × Trx-loop, (3) A. The performance of the MSIQ HW where Trx-start is the time consumed in the beginning of the execution As messages are sent the MSIQ Tx-master’s Avalon interface of the ISR before the Rx-loop iterations and where n = 1, …, Qsize is reads packet payloads of two words from the Tx-buffers, generates the number of Rx-ISR’s Rx-loop iterations which is limited by the size packets, and stores the packets to the Tx-FIFO. After storing the last of queues Qsize. Parameter Trx-loop is the time consumed by each of the packet of the message to the Tx-FIFO, it changes its status in order to Rx-loop iterations executed in step three of receiving as described in make the MSIQ Slave to generate an interrupt. The latency of reading subsection II.D. The receiving of new messages generates also the payloads of Npck packets is Dread(Npck) = Npck×4 +2 clock cycles. receive-requests, but the interrupts are masked during the execution of This includes the time required for generating and storing Npck packets the ISR. to the Tx-FIFO. The latency of sending Npck packets from the Tx-FIFO to the Micronswitch is Dsend(Npck) = Npck×5 clock cycles respectively. The service time of the Rx-interrupts is Since Dread(Npck) ≤ Dsend(Npck), when Npck ≥ 2, it can be concluded that Trx-int (n) = Tres + Trx-isr (n) + Trec, (4) the MSIQ Tx-master’s Tx-interface limits the throughput. where parameters n, Tres, and Trec are equal to those of formula (2). The MSIQ Rx-master’s Avalon interface reads packets from the Rx-FIFO, reads the Rx-buffer table elements and computes the storage In the performance analysis the operation of the MSIQ HW and addresses, and writes the packet payloads to the Rx-buffers. After the SW can be divided into periods during which the MSIQ masters send last packet of a message it changes its status in order to make the MSI or receive a certain number of messages and the ISR is executed once. Slave to generate an interrupt. The latency of writing the payloads of The length of the periods is denoted by Tperiod (n), where n = 1, …, Npck packets to the Rx-buffer is Dwrite(Npck) = 2 + Npck×2 + 2 clock Qsize is the number of serviced send-requests or receive-requests, i.e. cycles. The latency of receiving Npck packets through the Rx-interface the batch size. The length of the period is determined by the execution of the MSIQ Rx-master (RX-IF) is Dreceive(Npck) = Npck×5 clock time of the interrupt services or the time required for sending or cycles. Since Dwrite(Npck) ≤ Dreceive(Npck), when Npck ≥ 2, it can be receiving n messages. The value of parameter n is floating and its concluded that the MSIQ Rx-master’s Rx-interface limits the value depends also on the message size. The length of the period throughput. determines the theoretic maximum message rate As was shown the Tx-interface and the Rx-interface of the MSIQ Rmsg (n) = n / Tperiod (n) (5) Masters limit the throughputs like in the original MSI [8]. Therefore, in order to simplify the performance analysis of the MSIQ HW and and the theoretic maximum bit rate SW it can be assumed that the processing of every packet takes five Rbit (n) = Msize × Rmsg (n) = Msize × n / Tperiod (n), (6) clock cycles also by both of the Avalon interfaces of the MSIQ Masters and that Dread(Npck) = Dsend(Npck) = Dwrite(Npck) = Dreceive(Npck)
  • 5. where n = 1, …, Qsize and parameter Msize is the message size in bits. the interrupt services are requested and the throughput of the MSIQ The theoretic maximum bit rate Rbit (n) is the theoretic maximum Rx-master. throughput. Formulas of the maximum theoretic throughputs are derived for sending and receiving separately in the following two In the case that messages are shorter, the interrupt service time is subsections. longer than the receiving time of Qsize messages and Trx-int (Qsize) > Qsize × Trx-msg. In this case the HW receive-request queue is full most 1) The throughput with the send-request queue of the time and the MSIQ Rx-master must stop receiving until the Rx- ISR’s Rx-loop iterations read receive-requests from the HW receive- request queue. The interrupt service time Trx-int (n) determines clearly If Ttx-int (Qsize) = Qsize × Ttx-msg, where parameter Ttx-msg = the length of the periods and Tperiod (n) = Trx-int (n). Because at most Dsend(Npck) is the sending time of a message as defined in subsection Qsize receive-requests can be read from the HW receive-request queue III.A, the MSIQ Tx-master is able to send messages continuously and Qsize messages can be received during the periods, the theoretic without stopping the sending while processors is running the Tx-ISR. maximum throughput is achieved with value n = Qsize and Tperiod (Qsize) The HW send-request queue can never be emptied by the MSIQ Tx- = Trx-int (Qsize). Hence, the theoretic maximum throughput is master, because the processor runs the Tx-ISR which puts new send- requests to the HW send-request queue from the SW send-request Rbit (Qsize) = Msize × Qsize / Trx-int (Qsize). (9) queue. The MSIQ Tx-master generates interrupts after every sending In the case that messages are longer, the interrupt service time can of a message, but these interrupt service requests are masked if be shorter than the receiving time of Qsize messages and Trx-int (Qsize) ≤ processor is running the ISR. The performance analysis of the MSIQ Qsize × Trx-msg. Because the processors can service the receive-requests Tx-master consists of two separate cases, where either Ttx-int (Qsize) > of Qsize messages in a shorter time than the MSIQ Rx-master can Qsize × Ttx-msg or Ttx-int (Qsize) ≤ Qsize × Ttx-msg, since the message size receive the next Qsize messages, the receiving can be continued without affects the rate at which the interrupt services are requested and the stops and the receive-request queue can never become full. Finally, if throughput of the MSIQ Tx-master. the message size is further increased, the Rx-loop is executed only In the case that messages are shorter, the interrupt service time is once during every execution of the Rx-ISR and Trx-int (1) ≤ Trx-msg. longer than the sending time of Qsize messages and Ttx-int (Qsize) > Qsize Hence, if Trx-int (Qsize) ≤ Qsize × Trx-msg, the message size determines the × Ttx-msg. In this case the HW send-request queue is emptied and the number of received messages n during the periods and the length of MSIQ Tx-master must stop sending messages until the Tx-ISR puts the period Tperiod (n) = n × Trx-msg, where n = 1, …, Qsize. Thus, the the next send-requests into the HW send-request queue. Thus, with theoretic maximum message rate is Rmsg (n) = n / (n × Trx-msg) = 1 / shorter messages the interrupt service time Ttx-int (n) determines the Trx-msg and the theoretic maximum throughput is length of the period and Tperiod (n) = Ttx-int (n). The message rate is Rmsg Rbit (n) = Msize × Rmsg (n) = Msize / Trx-msg. (10) (n) = n / Tperiod (n) = n / Ttx-int (n), where n = 1, …, Qsize, and the bit rate is Rbit (n) = Msize × Rmsg (n). The theoretic maximum throughput is achieved with value n = Qsize, when the ISR loads Qsize send-requests C. Comparison of performances and costs to the HW send-request queue, and the theoretic maximum throughput The performances of the MSIQ and the MSI are presented in Fig. is 2 where the horizontal axis shows the message size in 32 bits wide words and the vertical axis shows the throughputs in GBits/s. The Rbit (Qsize) = Msize × Rmsg (Qsize) = Msize × Qsize / Ttx-int (Qsize). (7) throughputs were computed with 100 MHz clock. The throughputs of In the case that messages are longer, the interrupt service time can the basic MSI, which does not have the queues, are presented with be smaller than the sending time of the messages and Ttx-int (Qsize) ≤ lines Q1(300) and Q1(600). These lines are computed like in [13] with Qsize × Ttx-msg. Because the Tx-ISR can put a larger number of send- interrupt service times (Ttx-int, Trx-int) of 300 and 600 clock cycles. The requests to the HW send-request queue than the MSIQ Tx-master can throughputs of the MSIQ with queues of four send-requests and send during the interrupt service time Ttx-int (Qsize), the HW send- receive-requests are presented with lines Q4(450) and Q4(900). These request queue is nonempty most of the time and the sending can lines are computed with equal Tx-loop and Rx-loop execution times continue without stops. Because the number of Tx-loop iterations of (Ttx-loop, Trx-loop) of 450 and 900 clock cycles, and with the ISR start the Tx-ISR depends on the message size which determines the sending times (Ttx-start, Trx-start) of 20 clock cycles. The throughputs of the time, parameter n can also be smaller than Qsize. Hence, the sending MSIQ with the queues of eight requests are not presented, since they time of the messages determines the length of the period Tperiod (n) = n are quite similar to those of Q4(450) and Q4(900). This is because the × Ttx-msg, where n = 1, …, Qsize, and the theoretic maximum message total execution times of the loops dominate the total interrupt service rate Rmsg (n) = n / Tperiod (n) = n / (n × Ttx-msg) = 1 / Ttx-msg, where n = times as the number of loop iterations increases, which reduces the 1, …, Qsize. In this case the theoretic maximum throughput does not effect of the other delay parameters. The threshold message sizes of depend on the value of parameter n and it is Q4(450) and Q4(900) are 199 and 379 words respectively. With the threshold message sizes Ttx-int (Qsize) = Qsize × Ttx-msg = Qsize × Rbit (n) = Msize × Rmsg (n) = Msize / Ttx-msg. (8) Dsend(Npck) and Trx-int (Qsize) = Qsize × Trx-msg = Qsize × Dreceive(Npck). Thus, with 100 MHz clock the throughputs or the MSIQ saturate to 2) The throughput with the receive-request queue 1.28 GBits/s actually with smaller messages than Fig. 2 presents. Formulas (7) and (9) are used for computing the throughputs of the If Trx-int (Qsize) = Qsize × Trx-msg, where parameter Trx-msg = MSIQ for message sizes that are smaller than the threshold values and Dreceive(Npck) is the receiving time of a message as defined in formulas (8) and (10) are used for computing the throughputs with subsection III.A, the MSIQ Rx-master is able to receive the next Qsize message sizes that are higher than or equal to the thresholds. messages without stopping the receiving while the processor is By comparing line Q1(300) to line Q4(450) and line Q1(600) to running the ISR. This is because each interrupt services Qsize receive- line Q4(900) it can be concluded that with messages which are smaller requests while the MSIQ Rx-master receives the next Qsize messages. than 64 and 128 words the theoretic maximum throughputs of the The MSIQ Rx-master generates new interrupt service request after basic MSI and the MSIQ are quite similar. However, the throughputs receiving of messages, but these interrupt service requests are masked Q4(450) and Q4(900) of the MSIQ grow much faster as the message if processor is running the ISR. The analysis divides also into two size is increased and they saturate to 1.28 GBits/s already at the point separate cases, where either Trx-int (Qsize) > Qsize × Trx-msg or Trx-int of 256 and 512 words. Furthermore, the results in Fig. 2 do not show (Qsize) ≤ Qsize × Trx-msg, since the message size affects the rate at which the performance with message bursts. Because usually traffic contains
  • 6. also bursts of messages, it is necessary that the NI is able to achieve a costs. It would also be possible to reduce the HW costs by using high peak performance for short time intervals under burst traffic. This smaller send-request queues in the MSIQ without reducing the can be achieved by HW send-request and HW receive-request queues. performance significantly. For example, with queues of eight requests the MSIQ Masters are able to send and receive bursts of eight messages at the maximum rate without stopping their operation. ACKNOWLEDGMENT This research is funded by the Academy of Finland under grant 122361. REFERENCES [1] Z.D. Dittia, G.M. Parulkar, and J.R. Cox, “The APIC Approach to High Performance Interface Design: Protected DMA and Other Techniques,” Proc. of the IEEE International Conference on Computer Communications, Kobe, Japan, Apr. 7-12, 1997, pp. 823-831. [2] A.F. Diaz, J. Ortega, A. Canas, F.J. Fernandez, M. Anguita, and A. Prieto, “The lightweight Protocol CLIC on Gigabit Ethernet,” Proc. of the International Parallel and Distributed Processing Symposium, Nice, France, Apr. 22-26, 2003, pp. 8. [3] P. Gilfeather, and A.B. Maccabe, “Modeling Protocol Offload Figure 2. Theoretic maximum throughput of the MSI and the MSIQ. for Message-Oriented Communication,” Proc. of the IEEE Internatonal Conference on Cluster Computing, Burlington, Masschusets, USA, Sept. 27-30, 2005, pp. 1-10. The synthesis results are in Table 2. The MSIQs and the MSI contain Tx-FIFOs and Rx-FIFOs of four packets. The logic and [4] S.A. AlQahtani, “Performance Evaluation of Handling Interrupts register consumptions of the MSIQs and the MSI are quite similar, but Schemes in Gigabit Networks,” Proc. of the IEEE International Conference on Computer and Information Technology, Aizu- the amount of block memory bits grows clearly as the size of the Wakamatsu, Fukushima, Japan, Oct. 16-19, 2007, pp. 497-502. queues is increased. The maximum size of the queues is 16 requests. With queues of that size the MSIQ would consume 4096 block [5] B. Coglin, and N. Furmento, “Finding a Tradeoff between Host Interrupt load and MPI Latency over Ethernet,” Proc. of the memory bits, but it would provide also better theoretic maximum IEEE International Conference on Cluster Computing, New throughput and burst tolerance. Additionally, it would be possible to Orleans, Lousiana, USA, Aug. 31-Sept. 4, 2009, pp. 1-9. use smaller HW send-request queue so as to reduce the HW costs, [6] J. Mogul, and K.K. Ramakrishnan, Eliminating Receive livelock because the SW send-request queue can store a large number of send- in an Interrupt Driven Kernel, ACM transactions on Computer requests in any case. For example, with the HW send-request queue of Systems, Vol. 15, No. 3, Aug. 1997, pp. 217-252. four requests and the HW receive-request queue of 16 requests the [7] K. Langendoen, J. Romein, R. Bhoedjang, and H. Bal, MSIQ would consume also 2560 block memory bits. “Integrating Polling, Interrupts, and Thread Management,” Proc. of the Frontiers of Massively Parallel Computing symposium, Annapolis, MD, USA, Oct. 27-31, 1996, pp. 13-22. TABLE II. RESOURCE CONSUMPTIONS IN STRATIX III EP3SL150 [15] [8] H. Kariniemi, and J. Nurmi, “Micronmesh for Fault-tolerant MSI MSIQ with MSIQ with GALS Multiprocessors on FPGA,” Proc. of the International FPGA resource Symposium on System-on-Chip, Tampere, Finland, Nov. 4-6, Qsize = 1 Qsize = 4 Qsize = 8 Combinational 2008, pp. 1-8. 1550 1665 (7.4%) 1695 (9.3%) [9] H. Kariniemi, and J. Nurmi, “Fault-Tolerant Communication ALUTs over Micronmesh NoC with Micron Message-Passing protocol,” Memory ALUTs 0 0 (0.0%) 0 (0.0%) Proc. of the 11th internation symposium on System-on-Chip, Tampere, Finland, Oct. 5-7, 2009, pp. 5–12. Logic registers 1454 1609 (10.6%) 1609 (10.6%) [10] Altera Corp., NIOS II software developers handbook, Mach Block memory 2009. Website, <http://www.pldworld.com/_Semiconductors/ 1024 1792 (75.0%) 2560 (150.0%) bits Altera/one_click_niosII_docs_9_0/files/n2sw_nii5v2.pdf> 20.08.2010 IV. CONCLUSIONS [11] J. Labrosse, MicroC/OS-II The real-time kernel, Second ed., This paper presents MSIQ NI where a new queue mechanism is CMP Books, San Francisco, USA, 2002. used for batching interrupts in order to improve the performance. [12] H. Kariniemi, and J. Nurmi, “NoC Interface for Fault-Tolernt Interrupts generated by the NIs produce a lot of SW overhead and the Message-Passing Communication on Multiprocessor SoC performance can be improved by reducing the interrupt frequency. platform,” Proc. of the NORCHIP, Trondheim, Norway, Nov. This is achieved by the send-request and the receive-request queues 2009. which make it possible to batch interrupt service requests so that [13] Altera Corp., NIOS II processor reference handbook, November individual ISR executions can serve multiple interrupt requests. The 2009, Website, <http://www.altera.com/literature/hb/nios2/ throughput improves especially with longer messages. Furthermore, n2cpu_nii5v1.pdf> 20.08.2010 the burst tolerance against short messages improves. In addition to the [14] Altera Corp., Quartus II Handbook v10.0, Ch. 2: System interrupt batching this is also partly owing to that the request queues interconnect fabric for memory-mapped interfaces, July 2010, allow the MSIQ HW to continue sending and receiving messages Website, <http://www.altera.com/literature/hb/qts/ qts_qii54003.pdf > 20.08.2010 while processor is running the ISR. Hence, the new queue mechanism enables more efficient concurrent operation of the MSIQ HW and the [15] Altera Corp., Stratix III device handbook, Volume I, San Jose, SW. The results of the performance analysis and the logic synthesis USA, July 2010. Website, <http://www.altera.com/literature/hb/ stx3/stratix3_handbook.pdf> 20.08.2010 show also clearly that the performance can be improved with tolerable