           IN VHDL

                      KOSURU SAI MALLESWAR
 1. Objectives
 2. Pipelining – definition
 3. Modules of the project
 4. Coding technique
 5. VHDL files
  i. Register.vhd
  ii. Multiplexer.vhd
  iii. Pipelinestalling.vhd
  iv. Alu.vhd
  v. Pipelinedmultiplier.vhd
 6. User constraints file for FPGA
 7. Simulation waveforms – Performance Analysis
 8. Applications
 9. Conclusions
1. Objectives

    1. To design the “pipeline stalling system” used in the design of computers and other
       digital electronic devices to increase their instruction throughput i.e., to program a
       series of registers to move data from one stage to the next stage based on a
       common clock.
    2. To program an ALU that can fetch the opcode and operands in a pipelined sequence
       and executes the operations.
    3. To program a pipelined multiplier for the ALU, which can perform multiplication
       of 32 bit numbers using “partial multiply,shift and add” algorithm.

                                 2. Pipelining - definition
    Pipelining is an implementation technique where multiple instructions are overlapped
    in execution. The computer pipeline is divided in stages. Each stage completes a part
    of an instruction in parallel. The stages are connected one to the next to form a pipe -
    instructions enter at one end, progress through the stages, and exit at the other end.This
    allows the computer's control circuitry to issue instructions at the processing rate of the
    slowest step, which is much faster than the time needed to perform all steps at once.
    The term pipeline refers to the fact that each step is carrying data at once, and each step
    is connected to the next.

        The scheduling of transfer of data from one stage to next stage can be done with
   the help of a“clock”. Most modern CPUs are driven by a clock. The CPU consists
   internally of logic and register (flip flops). When the clock signal arrives, the flip flops
   take their new value and the logic then requires a period of time to decode the new
   values. Then the next clock pulse arrives and the flip flops again take their new values,
   and so on.

Pipelining does not decrease the time for individual instruction execution. Instead, it
    increases instruction throughput. The throughput of the instruction pipeline is
    determined by how often an instruction exits the pipeline.Because the pipe stages are
    hooked together, all the stages must be ready to proceed at the same time. The
    performance of a pipelined processor may vary widely between different programs.
3. Modules of the project
1. Stalling pipeline architecture with registers accepting data on rising edges:

 Logic diagram:

             Stalling pipeline architecture with registers accepting data on rising edges

2. ALU that can fetch the opcode and operands in a pipelined sequence
3. Pipelined multiplier that can multiply two 32 bit numbers by using partial
   multiply shift and add algorithm:
4.Coding technique

When storage elements accept data on a rising clock, initialize clock to 0 so that a
transition does not occur at time zero. The 3 registers R1, R2 and R3 are in the three stages
of processor namely fetching unit, decoding unit and executing unit. The registers are
pointing to the location from where program code is being read. Stall clock is “OR” of
clock and stall signal.

On first rising edge of stall clock the data in the R1 will be sent to R2; data in R2 will be
sent to R3. On next rising edge, R1 increments and points to the next location; Data in R2
will move to R3.If stall becomes low, R1 updates R2 at each rising edge of the clock and
R2 updates R3 at each rising edge of clock.

When stall becomes high, R1 transfers data to R2 and R1 is updated from memory on
rising edge of clock. But R3 doesn’t receive instructions. It receives zeros from
Multiplexer. This is useful for execution of instructions involving forward jump.

ALU is programmed by making use of the pipelined increment of the pointed memory
locations. The code is stored in the memory such that the contents first location specifies
the operation to be performed followed by the next locations which will contain the
operands. ALU fetches contents of 3 memory locations at a time. The arithmetic or logical
operation will be performed based on the most 16 significant bits of the instruction which
is presented by ir register.

The operands are stored in registers ar, br of ALU temporarily while calculations are
performed. The output of the ALU is given by alu_out.

In pipelined multiplier, the inputs are a and b, which are two unsigned 32 bit numbers. On
each rising edge a and b are multiplied and the output y is updated. Starting from right end,
a is multiplied with least 8 significant bits of b, then b shifts right by 8 digits and again
multiplies a with least 8 significant bits and so on till multiplication is completed. The 4
partial sums are added to produce the output.
5. VHDL files
   1. Register.vhd:

  library IEEE;
  use IEEE.std_logic_1164.all;

  entityregist is
  port(clk : in std_logic;
  clear : in std_logic;
  ip : in std_logic_vector (31 downto 0);
  op : out std_logic_vector (31 downto 0) );
  end entity regist;

  architecturebeh of regist is
  signaltemp:std_logic_vector(31 downto 0);
  reg: process(clk, clear)
  if clear='1' then
  temp<= (others=>'0');
          --elsifrising_edge(clk) then
  elsifclk='1' then
  temp<= ip ;
  end if;
  end process reg;
  end architecture beh;

 2. Multiplexer.vhd:

    library IEEE;
    use IEEE.std_logic_1164.all;

    entity mux is
    port(in0 : in std_logic_vector (31 downto 0);
         in1 : in std_logic_vector (31 downto 0);
    ctl : in std_logic;
    result : out std_logic_vector (31 downto 0));
    end entity mux;
architecturebeh of mux is
result<= in1 when ctl='1'
else in0 after 1 ns;
end architecture beh;

3. Pipelinestalling.vhd:

   library IEEE;
   use IEEE.std_logic_1164.all;
   use IEEE.std_logic_textio.all;
   use IEEE.std_logic_arith.all;

   entity pipe is
   reg1,reg2,reg3:inout std_logic_vector(31 downto 0));
   end entity pipe;

   architecture beh of pipe is
   signal clk    : std_logic := '0';        -- master clock
   signal stall : std_logic := '0';         -- stall signal
   signal sclk : std_logic := '0';           -- stall clock
   signal clear : std_logic := '1';          -- one shot clear

   subtype word is std_logic_vector(31 downto 0);
    signal zeros : word := (others=>'0');
    signal R1     : word;
    signal R1_a : word;
    signal R2     : word;
    signal R2_mux : word;
    signal R3     : word;
    signal cnt : word := "00000000000000000000000000000000";

    clock: process(clk)
             if clear='1' then
               clear <= '0' after 500 ps;
             end if;
             clk <= not clk after 5 ns;
            end process clock;
cnt <= unsigned(cnt)+unsigned'("00000000000000000000000000000001") after 1 ns
                 when sclk'event and sclk='1';

         stall <= '1' after 1 ns when R2="00000000000000000000000000000010" and
                else '0' after 1 ns;

            sclk <= clk or stall after 1 ns;

        -- pipeline stages
          R1_reg: entity work.regist port map(sclk, clear, cnt, R1);

                 R1_a <= R1 or "00000000000000000000000000000000" after 1 ns ; --logic

            R2_reg: entity work.regist port map(sclk, clear, R1_a, R2);

            A2_mux: entity work.mux port map(R2, zeros, stall, R2_mux);

            R3_reg: entity work.regist port map(clk, clear, R2_mux, R3);


            end beh;

    4. ALU.vhd:

library IEEE;

use IEEE.std_logic_1164.all;
use IEEE.std_logic_unsigned.all;
use IEEE.std_logic_arith.all;
use ieee.numeric_std.all;

entity alu is
port (clk        : in std_logic;
ir, ar, br : inout std_logic_vector(31 downto 0);
alu_sig : out std_logic;
alu_out : out std_logic_vector(63 downto 0));
end alu;

architecture beh of alu is
signal alu_st : std_logic;
signal alu_output : std_logic_vector(63 downto 0);
type mem is array (0 to 31) of std_logic_vector(31 downto 0);
signal temp_mem:mem;

constant content:mem :=

component pipe is
reg1,reg2,reg3:inout std_logic_vector(31 downto 0));
end component;

reg: pipe port map(reg1=>br,reg2=>ar,reg3=>ir);
clocked_alu: process(clk,ir)
if (rising_edge(clk)) then
alu_output<=(others =>'0');
alu_st<= '1';

case ir(31 downto 16) is
when "0000000000000000" =>
when "0000000000000001" =>
alu_output<= temp_mem(conv_integer(ar))*temp_mem(conv_integer(br));
when "0000000000000010" =>
alu_output<= temp_mem(conv_integer(ar))-temp_mem(conv_integer(br));
when "0000000000000011" =>
alu_output<= temp_mem(conv_integer(br))-temp_mem(conv_integer(ar));
when "0000000000000100" =>
alu_output<= temp_mem(conv_integer(ar)) and temp_mem(conv_integer(br));
when "0000000000000101" =>
alu_output<= temp_mem(conv_integer(ar)) or temp_mem(conv_integer(br));
when "0000000000000110" =>
alu_output<= temp_mem(conv_integer(ar)) xor temp_mem(conv_integer(br));
when "0000000000000111" =>
alu_output<= temp_mem(conv_integer(ar)) nand temp_mem(conv_integer(br));
when "0000000000001000" =>
alu_output<= temp_mem(conv_integer(ar)) nor temp_mem(conv_integer(br));
when "0000000000001001" =>
alu_output<= not(temp_mem(conv_integer(ar)));
when others => null;
end case;
    end if;
alu_sig<= alu_st;
alu_out<= alu_output;
end process clocked_alu;
end beh;
5. Pipelinedmultiplier.vhd:

      use ieee.std_logic_1164.all;
      entitypipemult is
      port (
           clk1        : in std_logic ;
      a, b         : in unsigned(31 downto 0) ;
      y            : out unsigned(63 downto 0)
      endpipemult ;

      architecture rtl3 of pipemult is
      signal y1, y2, y3, y4, y5 : unsigned (39 downto 0) ;
      constant z : unsigned (63 downto 0) := (others => '0');

      if (rising_edge(clk1)) then
      y1 <= a * b( 7 downto 0) ;
      y2 <= a * b(15 downto 8) ;
      y3 <= a * b(23 downto 16) ;
      y4 <= a * b(31 downto 24) ;
          y <= (z(63 downto 40) & y1 ) +
               (z(63 downto 48) & y2 & z( 7 downto 0)) +
               (z(63 downto 56) & y3 & z(15 downto 0)) +
               (y4 & z( 23 downto 0)) ;
      end if;
      end process;
      end rtl3 ;
6. User constraints file for FPGA

  NET "clk" LOC = "AJ15";
  NET "clear" LOC = "AC11"; #SW0

  NET "R3(31)" LOC = "T7";
  NET "R3(30)" LOC = "T8";
  NET "R3(29)" LOC = "U4";
  NET "R3(28)" LOC = "U5";
  NET "R3(27)" LOC = "V2";
  NET "R3(26)" LOC = "W2";
  NET "R3(25)" LOC = "T9";
  NET "R3(24)" LOC = "U9";
  NET "R3(23)" LOC = "V3";
  NET "R3(22)" LOC = "V4";
  NET "R3(21)" LOC = "W1";
  NET "R3(20)" LOC = "Y1";
  NET "R3(19)" LOC = "U7";
  NET "R3(18)" LOC = "U8";
  NET "R3(17)" LOC = "V5";
  NET "R3(16)" LOC = "V6";
  NET "R3(15)" LOC = "W3";
  NET "R3(14)" LOC = "W4";
  NET "R3(13)" LOC = "AA1";
  NET "R3(12)" LOC = "AB1";
  NET "R3(11)" LOC = "W5";
  NET "R3(10)" LOC = "W6";
  NET "R3(9)" LOC = "Y4";
  NET "R3(8)" LOC = "Y5";
  NET "R3(7)" LOC = "AA3";
  NET "R3(6)" LOC = "AA4";
  NET "R3(5)" LOC = "W7";
  NET "R3(4)" LOC = "W8";
  NET "R3(3)" LOC = "AB3"; NET "R3(2)" LOC = "AB4";
  NET "R3(1)" LOC = "AB2";
  NET "R3(0)" LOC = "AC2";
7. Simulation waveforms – Performance Analysis
Output wave forms of pipeline simulation on Modelsim:
   Pipeline simulation:

   Pipelined multiplier simulation
8. Applications
Pipelining for multicore computers:

Using a Pipeline architecture is a common and effective method of increasing
throughput and reducing loop execution times on multicore computers. Pipelining
can be used when data must go through multiple processes that can be broken into
stage. Pipelining is a type of task parallelism that can be implemented for a series
of serial tasks that have data dependencies.

Operating systems design:

In Unix-like computer operating systems (and, to some extent, Windows), a
pipeline is the original software pipeline: a set of processes chained by their
standard streams, so that the output of each process (stdout) feeds directly as input
(stdin) to the next one. Each connection is implemented by an anonymous pipe.
Filter programs are often used in this configuration.
Super scalar pipelining:

Superscalar pipelining involves multiple pipelines in parallel. Internal components
of the processor are replicated so it can launch multiple instructions in some or all
of its pipeline stages. The RISC System/6000 has a forked pipeline with different
paths for floating-point and integer instructions. If there is a mixture of both types
in a program, the processor can keep both forks running simultaneously. Both
types of instructions share two initial stages (Instruction Fetch and Instruction
Dispatch) before they fork. Often, however, superscalar pipelining refers to
multiple copies of all pipeline stages (In terms of laundry, this would mean four
washers, four dryers, and four people who fold clothes). Many of today's machines
attempt to find two to six instructions that it can execute in every pipeline stage. If
some of the instructions are dependent, however, only the first instruction or
instructions are issued.

Pipelining to firmware:

pipelining at the firmware level of machine organization can provide significant
execution time benefits for certain types of instructions. The essential concept
involved with this approach is the pipelining of operations within the hardware
under direct control of the firmware, rather than the pipelining of

Dynamic pipelining:

Dynamic pipelines have the capability to schedule around stalls. A dynamic
pipeline is divided into three units: the instruction fetch and decode unit, five to ten
execute or functional units, and a commit unit. Each execute unit has reservation
stations, which act as buffers and hold the operands and operations.
9. Conlusions
To summarize, pipelining is a technique that programmers can use to gain a
performance increase in inherently serial applications (on multicore machines).
The CPU industry trend of increasing cores per chip means that strategies such as
pipelining will become essential to application development in the near future.In
order to gain the most performance increase possible from pipelining, individual
stages must be carefully balanced so that no single stage takes a much longer time
to complete than other stages.

The project has been done on pipelined execution unit, pipelined multiplier and
pipelined alu. Handling of structural, data and control hazards can also be
programed to improve design efficiency, since it is very important for physical
implementation of the design. Cache miss handling and exception handling are
also required for improving performance of pipelining for RISC like systems.
Pipeline stalling in vhdl

