15. def testXLA_JIT(self):
with tf.Session() as sess:
x = tf.placeholder(tf.float32, [2], name="x")
with tf.device("device:XLA_GPU:0"):
y = x * 2
result = sess.run(y, {x: [1.5, 0.5]})
XLA_GPUで実行するには!
28. """Configuration file for an XLA plugin.
- please don't check in changes to this file
- to prevent changes appearing in git status, use:
git update-index --assume-unchanged tensorflow/compiler/plugin/BUILD
To add additional devices to the XLA subsystem, add targets to the
dependency list in the 'plugin' target. For instance:
deps = ["//tensorflow/compiler/plugin/example:plugin_lib"],
"""
licenses(["notice"])
package(
default_visibility = ["//visibility:public"],
)
cc_library(
name = "plugin",
deps = [
"//tensorflow/compiler/plugin/dynamic:dynamic_plugin_lib "
],
)
BUILD
30. Scenario 3: Non-CPU-like hardware without an existing LLVM backend
If it is not possible to utilize LLVM, then the best option is to implement a new backend for XLA for
the desired hardware. This option requires the most effort. The classes that need to be implemented
are as follows:
StreamExecutor: For many devices not all methods of StreamExecutor are needed. See existing
StreamExecutor implementations for details.
xla::Compiler: This class encapsulates the compilation of an HLO computation into an
xla::Executable.
xla::Executable: This class is used to launch a compiled computation on the platform.
xla::TransferManager: This class enables backends to provide platform-specific mechanisms for
constructing XLA literal data from given device memory handles. In other words, it helps encapsulate
the transfer of data from the host to the device and back.
Developing a new backend for XLA
https://www.tensorflow.org/performance/xla/developing_new_backend
33. import tensorflow as tf
import numpy as np
x = tf.placeholder(tf.float32, shape=(2, 3))
y = tf.placeholder(tf.float32, shape=(3))
with tf.device("/job:localhost/replica:0/task:0/device: XLA_NGRAPH:0"):
a = x + y
with tf.Session() as sess:
res = sess.run(a, feed_dict={x: np.ones((2,3)), y: np.ones((3,))})
print("result:", res)
サンプルコード
Device名
35. tensorflow/compiler/plugin/dynamic/plugin_adapter.cc
volatile bool module_initialized = InitPluginModule();
bool InitPluginModule() {
// We are running as part of TensorFlow python environment
auto tf_root = xla::dynamic_plugin::GetTensorflowRoot();
auto plugin_directory = tf_root + "/plugins/";
std::string pattern = plugin_directory + "*.so";
std::vector<std::string> files;
auto result = tensorflow::Env::Default()->GetMatchingPaths(pattern, &files);
tensorflow::LoadDynamicPlugin (files[0]);
}
PlugInライブラリのロード
${tf_root}/plugins/*.so をロードする
ただし、
最初の1個目のライブラリをロード
36. tensorflow/compiler/plugin/dynamic/plugin_adapter.cc
static bool LoadDynamicPlugin (std::string lib_path) {
void* handle;
auto result =
tensorflow::Env::Default()->LoadLibrary(lib_path.c_str(), &handle);
// Get the Plugin object
xla::plugin::Info (*GetPluginData)();
result = tensorflow::Env::Default()-> GetSymbolFromLibrary (
handle, "GetPluginData", (void**)(&GetPluginData));
LoadDynamicPlugin 関数(その1)
ロードしたライブラリから
GetPluginData関数のポインタを獲得
37. // Get the plugin info
xla::plugin::Info plugin_info = GetPluginData();
// Get the function pointers to the plugin methods
auto Version = plugin_info.Version;
auto DeviceInfo = plugin_info.DeviceInfo;
auto RunBackend = plugin_info.RunBackend;
auto GetTransferManager = plugin_info.GetTransferManager;
auto Init = plugin_info.Init;
auto SupportedDataTypes = plugin_info.SupportedDataTypes;
auto device_info = DeviceInfo();
LoadDynamicPlugin 関数(その2)
38. // Create the platform id - unique for each plugin
// TODO - create a unique value for platform id. Can't use
// PLATFORM_DEFINE_ID() inside a function
static int delta = 0;
int temp;
perftools::gputools:: Platform::Id kPluginPlatformId = &temp + delta;
delta++;
// Kernel registrations
auto supported_data_types = SupportedDataTypes();
REGISTER_XLA_LAUNCH_KERNEL (device_info.XLA_DEVICE_NAME,
tensorflow::XlaLocalLaunchOp,
supported_data_types);
REGISTER_XLA_DEVICE_KERNELS (device_info.XLA_DEVICE_NAME,
supported_data_types);
REGISTER_XLA_BACKEND (device_info.XLA_DEVICE_JIT_NAME,
supported_data_types, OpFilter);
LoadDynamicPlugin 関数(その3)
Kernelの登録
Backendの登録
39. // Platform registration
std::unique_ptr<perftools::gputools::Platform> platform(
new xla::dynamic_plugin:: PlatformAdapter(
device_info.PLATFORM_NAME, kPluginPlatformId,
device_info.visible_device_count));
perftools::gputools::MultiPlatformManager::RegisterPlatform(
std::move(platform));
// Call the Plugin Init
auto status = plugin_info.Init(kPluginPlatformId);
// Register the Compiler facory
xla::Compiler::RegisterCompilerFactory (kPluginPlatformId, [=]() {
return xla::MakeUnique< xla::dynamic_plugin::CompilerAdapter >(
kPluginPlatformId, plugin_info);
});
LoadDynamicPlugin 関数(その4)
Platformの登録
Compilerの登録
45. // Transfer manager registration
// Note: Ideally - we want to create the TransferManager with an implemenation
// but currently the creation is handled by the Registration method - which
// doesn't allow passing parameters to the constructor.
// This is inconsistent with the Compiler factory!
// Register with the factory
xla::dynamic_plugin::TransferManagerAdapter::Init(kPluginPlatformId)
xla::dynamic_plugin::TransferManagerAdapter* new_transfer_manager{nullptr};
const perftools::gputools::Platform* this_platform;
auto statusor = perftools::gputools::MultiPlatformManager::PlatformWithId(
kPluginPlatformId);
if (statusor.ok()) {
this_platform = statusor.ValueOrDie();
}
LoadDynamicPlugin 関数(その5)
47. // Register the Device - at the very last. That way - if we failed with other
// steps above, the device won't be available and users will get an error at
// the Python script stage
// Set priority to be below the default priority (50),
// so that Executor is not selected as a high priority device over other
// default devices. See constructor comments for Registrar in
// tensorflow/core/common_runtime/device_factory.h for a list of priority for
// devices.
DeviceFactory::Register (
device_info.XLA_DEVICE_NAME,
new DeviceFactoryAdapter (device_info.PLATFORM_NAME,
device_info.XLA_DEVICE_NAME,
device_info.XLA_DEVICE_JIT_NAME),
device_info.device_priority );
return true;
}
LoadDynamicPlugin 関数(その7)
Deviceの登録
52. テストコード
tensorflow/compiler/plugin/dynamic/example/plugin_test.py
# Create the model
x = tf.placeholder(tf.float32, [None, 784])
w = tf.Variable(tf.zeros([784, 10]))
b = tf.Variable(tf.zeros([10]))
# Define loss and optimizer
y_ = tf.placeholder(tf.float32, [None, 10])
with tf.device('/device: DYNAMIC_PLUGIN_EXAMPLE_DEVICE:0'):
y = tf.matmul(x, w) + b
sess = tf.Session(config=config)
tf.global_variables_initializer().run(session=sess)
60. executable::ExecuteOnStream
tensorflow/compiler/xla/service/executable.h
// Enqueues the compilation result on the provided stream,
// passing the given arguments.
// This call is blocking and returns after the execution is done.
//
// If the hlo_execution_profile is provided as non-nullptr, profiling will be
// enabled.
//
// Returns a shaped buffer containing the result of the computation.
virtual StatusOr<std::unique_ptr<ShapedBuffer>> ExecuteOnStream(
const ServiceExecutableRunOptions* run_options,
tensorflow::gtl::ArraySlice<const ShapedBuffer*> arguments,
HloExecutionProfile* hlo_execution_profile) = 0;
62. Pluginコード:ExecuteOnStream
tensorflow/compiler/plugin/dynamic/example/executable.cc
// Transform the ShapedBuffer arguments into literals which the
// evaluator consumes.
std::vector<std::unique_ptr<xla::Literal>> arg_literals;
for (tensorflow::int64 p = 0; p < computation->num_parameters(); ++p) {
TF_ASSIGN_OR_RETURN(
std::unique_ptr<xla::Literal> arg_literal,
m_transfer_manager->TransferLiteralFromDevice (executor, *arguments[p]));
arg_literals.push_back(std::move(arg_literal));
}
// Execute the graph using the HloEvaluator.
xla::HloEvaluator evaluator;
TF_ASSIGN_OR_RETURN(std::unique_ptr<xla::Literal> result_literal,
evaluator.Evaluate<std::unique_ptr<xla::Literal>>(
*computation, arg_literals));
TransferLiteralToDeviceでは?
63. Pluginコード:ExecuteOnStream
tensorflow/compiler/plugin/dynamic/example/executable.cc
// Make sure that the result shape is not empty
TF_RET_CHECK(!xla::ShapeUtil::IsNil(result_literal->shape()));
TF_ASSIGN_OR_RETURN(std::unique_ptr<xla::ShapedBuffer> result,
m_transfer_manager->AllocateShapedBuffer (
result_literal->shape(), run_options->allocator(),
executor->device_ordinal()));
TF_RETURN_IF_ERROR( m_transfer_manager->TransferLiteralToDevice (
executor, *result_literal, *result));
return std::move(result);
}
TransferLiteralFromDeviceでは?
90. abs
acos
add
allreduce
asin
atan
av_pool
broadcast
ceiling
concat
constant convert
convolution
copy
reference : ops
ngraph-0.2.1/src/runtime/reference
cosh
cos
divide
dot
equal
exp floor
greater_eq
greater
less_eq
less log
max
maximum
max_pool
min
minimum multiply
negate not_equal
not
one_hot
pad
power
product
reduce
reduce_window
relu replace_slice
reshape
result
reverse
select_and_scatter
select
sign
sinh
sin
slice
softmax
sqrt
subtact
sum
tanh
tan
91. def main(_):
with tf.device('/device:NGRAPH:0'):
run_mnist(_)
// デフォルトでは”CPU"。
// 環境変数XLA_NGRAPH_BACKENDで指定できる
// CPU / GPU / INTERPRETER
def run_mnist(_):
# Import data
mnist = input_data.read_data_sets( FLAGS.data_dir,
one_hot=True )
...
Run MNIST Softmax with the activated bridge
引用:https://github.com/NervanaSystems/ngraph-tensorflow-bridge
93. Raspberry Pi 3
A53x4
内部バス
GPGPU部
DRAM Host側
Device側
Dynamically loadable XLA Plugin
図 : 引用、https://www.raspberrypi.org/products/raspberry-pi-3-model-b/
VideoCore IV (Broadcom)
94. QMKL v1.0.0, 2018.04.10
https://github.com/Idein/qmkl
QMKL is a Math Kernel Library for VideoCore IV QPU.
QMKL is compatible with Intel MKL except for double precision etc.
We, Idein Inc., built object recognition demos (GoogLeNet etc.) on Raspberry Pi.
The demos run on QPU using both QMKL and our private libraries, which are
highly optimized for neural networks. Please check out our video on YouTube.
96. HiKey960
https://www.96boards.org/product/hikey960/
・Hisilicon Kirin 960
・ARM Cortex-A53x4 + ARM Mali G71 MP8
・3GB or 4GB LPDDR4 SDRAM
・32GB UFS Flash Storage
・WiFi (2.4- / 5-GHz) and Bluetooh 4.1
・1 x USB 2.0 type C OTG
・2x USB 3.0, 1x USB 2.0 Type
・1 x HDMI 1.4 (Type A - full)
・12V@2A、
4.75mm outer / 1.7mm inner
3GB版:239ドル、
4GB版:Switch Scienceで32270円(税込み)