I think this presentation about Adapteva's Parallella is one of the most comprehensive till now. Feel free to use it. I gave this talk on 10th Dec 2014 at Cloud Research Lab, Ericsson AB, Lund, Sweden.
2. Outline
This Presentation was held on
10th Dec 2014
Place:
Ericsson Research Lab, Lund
Sweden This work is licensed under a Creative Commons Attribution 4.0 International License.
4. Genesis
Influenced by Open Source Hardware Design
projects:
Arduino
Beaglebone
Inspired by:
Raspberry Pi
Zedboard
The board is open source hardware*
*https://github.com/parallella/parallella-hw
5. In News “Smallest Supercomputer in the World”
Adapteva A-1…...
• Launched at
ISC'14*
• It has 2.112 RISC
cores
• Based on 64-core
Epiphany board
• Power
Consumption 200
Watt.
• Performance: 16
Gflop/s per Watt
*http://primeurmagazine.com/weekly/AE-PR-07-14-104.html
Image Source:
https://twitter.com/StreamComputing/media
6. Adapteva (Zynq + Epiphany III)
• Based on Epiphany™ architecture (Multi-core MIMD
Architecture)
• SoC fully programmable Xilinx Zynq with dual core CPU
ARM Cortex-A9
• 16/64-core microprocessor/coprocessor:
No cache
32-bit cores
Max Clock Speed 1 GHz (600 MHz)
Peak Performance : 32 GFLOPS
Support Fused Multiply–Add (FMA) operations
Superscalar floating-point (IEEE-754) RISC CPU Core
Two floating point operations /clock cycle.
• Supports Static Dual-Issue Scheduling
7. Adapteva (Zynq + Epiphany III)
IALU: Single 32-bit
integer operation/clk. cycle.
FPU: Single floating-point
instruction /clk cycle
64 General purpose registers
Program Sequencer supports
all standard program flows….
Branching costs 3 cycles.
No hardware support:
Integer multiply
Floating point divide
Double-precision
floating point ops.
eCore CPU(1)
8. Epiphany Architecture(1)
Every router in the mesh is connected to North, East, West, South, and to a
mesh node.
Routers at every node contains round-robin arbiters.
Routing hop latency is 1.5 clock cycles
9. Interconnects
• Ecores are Connected by 2D
low-latency NoC (eMesh)
rMesh for read
xMesh for off-chip write
cMesh for on-chip write
• eMash has only nearest-neighbor
direct connections.
• Each routing link can
transfer up to 8 bytes data
on every clock cycle. Network-On-Chip Overview(1)
10. Interconnects
Network Topology(1)
• Network complete
transactions in a single
clock cycle because of
spatial locality and short
point-to-point on-chip
wires.
• Each mesh node has
globally addressable ID (6
row-ID and 6 col-ID)
11. Memory
• Shared memory (32 bit wide flat memory and
Chip Core Start Address End Address Size
(0,0) 00000000 00007FFF 32KB
unprotected)
• Primary Memory: 1GB (DDR3 SDRAM)
• Flash Memory: 128Mb (Boot code)
• Is a little-endian memory architecture.
• This, single, flat address space consisting of 232 8-
bit bytes.(consisting of 230 32-bit words)
• SRAM Distribution:
12. Memory
• On every clock cycle 64 bits of data / instructions
can be exchanged between memory and CPU’s
register file, network interface or local DMA.
• Dual channel DMA engine
• Memory Mapped Registers
• Each eCore has 32KB of local memory(4 sub-banks *
8KB)
• eCPU has a variable-length instruction pipeline that
depends on the type of instruction being executed.
14. Memory: Read-Write Transactions
• Read transactions are non-blocking
• RW transactions from local memory follow a strong
memory-order model.
• RW transactions that access non-local memory
follow weak memory-order model.
• Soln: Use run-time synchronization calls with
order-dependent memory sequences.
• Less inter-node communication
15. Scalability
• It has four identical source-synchronous
bidirectional
off chip eLink.
• eLink is non-blocking
• Optimal bandwidth is
achieved when a large
number of incrementally
numbered 64 bit data
packets are sent
consecutively
FPGA eLink Integration(1)
18. How to get started..
1. Create a Parallella
micro-SD card1
2. Connect the wires
mentioned in2
3. Power On
4. Go...
1. http://www.parallella.org/create-sdcard/
2. http://www.parallella.org/quick-start/
19. Epiphany Host Library (eHAL)
• Encapsulates low-level Epiphany functionality
(Epiphany device driver)
• Library interface is defined in “e-hal.h”.
• Steps to write a program:
1. Prepare the system:
e_init(NULL); //Initialize system
e_reset_system(); //reset the platform
e_get_platform_info(&platform); // get the
actual system parameters
20. Epiphany Host Library (eHAL)
2. Allocate Memory(optional)
e_mem_t emem; // object of type e_mem_t
char emsg[Size];
e_alloc(&emem, <BufOffset>, <BufferSize>);
//Allocate a buffer in shared external memory
3. Open Workgroup:
e_open(&dev, 0, 0, platform.rows, platform.cols);
// open all cores
(OR)
e_open(&dev, 0, 0, 1, 1); // Core coordinates relative to
the workgroup.
e_reset_group(&dev); //Soft Reset
21. Epiphany Host Library (eHAL)
4. Load program
e_load("program", &dev, 0, 0, E_TRUE);
5. Wait and then print message from buffer.
usleep(time);
e_read(&emem, 0, 0, 0x0, emsg, _BufSize);
fprintf(stderr, ""%s"n", emsg);
6: Close every connection.
e_close(&dev);
e_free(&emem);
e_finalize();
22. Epiphany Hardware Utility Library
(eLib)
• Provides functions for configuring and querying
eCores.
• Also automates many common programming tasks in
eCores
• Steps to write an eCore program
• Step1: Declare shared memory:
char outbuf[128] SECTION("shared_dram");
• Step2: Enquire about eCore id:
e_coreid_t coreid;
coreid = e_get_coreid();
• Step3: Print “Hello World” with core id
• Step4: Exit
25. Where to put the code..
• 3 different Linker Description Files (LDF)
• Internal.ldf : Store Data/Ins. in internal SRAM
(limit 32KB).
• Fast.ldf : User code/data and stack in internal
SRAM. Standard libraries in external DRAM.
Good for few large library functions
• Legacy.ldf: Everything stored in external DRAM
(limit 1MB)
Slower than internal and legacy..
26. Synchronization(eCores)
http://www.linuxplanet.org/blogs/?cat=2359
Barrier for synchronizing
parallel executing threads
1. Setup
e_barrier_init(bar_array[],tgt_bar_arr
ay[])
2. Call Function
3. Wait for sync
e_barrier(bar_array[],tgt_bar_array[]
Mutex(blocking & non
blocking)..
1. Setup:
e_mutex_init(0,0,s_mutex, mutex_attr)
2. Gain access:
e_mutex_lock(0,0,s_mutex)
3. Call function
4. Release access
e_mutex_unlock(0,0,s_mutex)
28. My Understanding
Synchronization between the ARM and eCores use
flag
Because: eMesh writes from an individual Epiphany core to the
external shared DRAM will update the DRAM in the same order
as they were sent. However if multiple cores are writing to
external DRAM, the sequence of writing into the DRAM will be
changed.
Soln:
1. Set Flag
2. Use software barrier function e_barrier() (time
consuming)
3. Use the experimental hardware barrier opcode
29. Useful for Sync
Ecore side Read & Write:
e_write(remote, Dst, row, col, Src, Byte_size);
e_read(remote, Dst, row, col,Src, Byte_size);
Remote parameter must be either:
e_group_config if remote is workgroup core
or
e_emem_config if remote is an external memory buffer
30. Conclusion
• Fast and power efficient
• Power needed 5V/2A (0.3A -1.5A)
• Fully-featured ANSI-C/C++ and OpenCL
programming environments
• Large Application domain support
• But..
• Need Improved SDK (on the way..)
• Cache might improve the performance (software cache is
on the way…)
• Synchronization and randomness is a big issue…