SlideShare ist ein Scribd-Unternehmen logo
1 von 29
Downloaden Sie, um offline zu lesen
Notes about moving
from Python to C++
Yung-Yu Chen (@yungyuc)

PyConTW 2020, Tainan, Taiwan
1
Why move?
• Numerical software: Computer programs to solve scientific or
mathematic problems.

• Other names: Mathematical software, scientific software, technical
software.

• Python is a popular language for application experts to describe the
problems and solutions, because it is easy to use.

• Most of the computing systems (the numerical software) are designed in
a hybrid architecture.

• The computing kernel uses C++.

• Python is chosen for the user-level API.
2
What I do
• By training, I am a computational scientist, focusing on
applications of continuum mechanics. (I am not a
computer scientist.)

• In my day job, I write high-performance code for
semiconductor applications of computational geometry
and photolithography.

• In my spare time, I am teaching a course ‘numerical
software development’ in the department of computer
science in NCTU.
3
twitter: https://twitter.com/yungyuc

linkedin: https://www.linkedin.com/in/yungyuc/.
Example: PDEs
4
Numerical simulations of
conservation laws:

∂u
∂t
+
3
∑
k=1
∂F(k)
(u)
∂xk
= 0
Use case: stress waves in 

anisotropic solids
Use case: compressible flows
Example: OPC
5
photoresist
silicon substrate
photomask
light source
Photolithography in semiconductor fabrication
wave length is only
hundreds of nm
image I want to
project on the PR
shape I need
on the mask
Optical proximity correction (OPC)
(smaller than the
wave length)
write code to
make it happen
Python: the good is the bad
• Dynamicity enables
convention over
compilation for productivity

• It is so when you have
good testing and
documentation
import math
class Point:
def __init__(self, x, y):
self.x = x
self.y = y
@property
def length(self):
return math.hypot(self.x, self.y)
5.0print(Point(3, 4).length)
print(Point("3", "4").length) Traceback (most recent call last):
File "ambiguity.py", line 10, in <module>
print(Point("3", "4").length)
File "ambiguity.py", line 8, in length
return math.hypot(self.x, self.y)
TypeError: a float is required
6
What is fast?
worker(Data&, unsigned long, int):
mov DWORD PTR [rdi+rsi*4], edx
mov rax, rdi
ret
class Data {
int m_storage[1024];
public:
int & operator[] (size_t it) { return m_storage[it]; }
};
Data & worker(Data & data, size_t idx, int value) {
data[idx] = value;
return data;
}
High-level language
Machine code
7
Dynamic typing is slow
• Python trades runtime for convenience; everything is
indirect and incurs runtime overhead
def work(data, key, name, value):
# Assign value to the data as a container with key.
data[key] = value
# Assign value to the data as an object with name.
setattr(data, name, value)
return data
8
• The innocent-looking Python code uses many checking
functions under the hood
int PyObject_SetItem(PyObject *o, PyObject *key, PyObject *value)
{
PyMappingMethods *m;
if (o == NULL || key == NULL || value == NULL) {
null_error();
return -1;
}
m = o->ob_type->tp_as_mapping;
if (m && m->mp_ass_subscript)
return m->mp_ass_subscript(o, key, value);
if (o->ob_type->tp_as_sequence) {
if (PyIndex_Check(key)) {
Py_ssize_t key_value;
key_value = PyNumber_AsSsize_t(key, PyExc_IndexError);
if (key_value == -1 && PyErr_Occurred())
return -1;
return PySequence_SetItem(o, key_value, value);
}
else if (o->ob_type->tp_as_sequence->sq_ass_item) {
type_error(/* ... omit ... */);
return -1;
}
}
type_error(/* ... omit ... */);
return -1;
}
int PyObject_SetAttr(PyObject *v, PyObject *name, PyObject *value)
{
PyTypeObject *tp = Py_TYPE(v);
int err;
if (!PyUnicode_Check(name)) {
PyErr_Format(PyExc_TypeError, /* ... omit ... */);
return -1;
}
Py_INCREF(name);
PyUnicode_InternInPlace(&name);
if (tp->tp_setattro != NULL) {
err = (*tp->tp_setattro)(v, name, value);
Py_DECREF(name);
return err;
}
if (tp->tp_setattr != NULL) {
const char *name_str = PyUnicode_AsUTF8(name);
if (name_str == NULL) {
Py_DECREF(name);
return -1;
}
err = (*tp->tp_setattr)(v, (char *)name_str, value);
Py_DECREF(name);
return err;
}
Py_DECREF(name);
_PyObject_ASSERT(name, name->ob_refcnt >= 1);
if (tp->tp_getattr == NULL && tp->tp_getattro == NULL)
PyErr_Format(PyExc_TypeError, /* ... omit ... */);
else
PyErr_Format(PyExc_TypeError, /* ... omit ... */);
return -1;
}
data[key] = value setattr(data, name, value)
(in Objects/abstract.c) (in Objects/object.c)
static PyObject *__pyx_pf_6simple_work(
PyObject *__pyx_v_data, PyObject *__pyx_v_key
, PyObject *__pyx_v_name, PyObject *__pyx_v_value
) {
PyObject_SetItem(__pyx_v_data, __pyx_v_key, __pyx_v_value);
PyObject_SetAttr(__pyx_v_data, __pyx_v_name, __pyx_v_value);
return __pyx_v_data;
}
def work(data, key, name, value):
data[key] = value
setattr(data, name, value)
return data
mapping
sequence
attribute name check
check to set
9
Speed is the king
To pursue high-performance and flexibility simultaneously,
the systems need to use the best tools at the proper layers.
This is a path to develop such systems.
10
Python controls execution flow
C++ manages resources
Use C++ to generate fast assembly code
Glue Python and C++
PyObject uses malloc
PyObject_MALLOC() is a wrapper to the standard C memory
allocator malloc().
11
PyObject *
_PyObject_New(PyTypeObject *tp)
{
PyObject *op;
op = (PyObject *) PyObject_MALLOC(_PyObject_SIZE(tp));
if (op == NULL)
return PyErr_NoMemory();
return PyObject_INIT(op, tp);
}
Python call stack
• Python maintains its
own stack

• It’s different from that of
the ‘real machine’

• The two (C/C++ and
Python) call stacks work
very differently
12
#0 line 11: show_stack
#1 line 23: caller
#2 line 25: <module>: top-level module
local_val is: defined outside
import inspect
def show_stack():
# Print calling stacks.
f = inspect.currentframe()
i = 0
while f:
name = f.f_code.co_name
if '<module>' == name:
name += ': top-level module'
print("#%d line %d: %s" % (i, f.f_lineno, name))
f = f.f_back
i += 1
# Obtain a local variable defined in an outside stack.
f = inspect.currentframe().f_back
print('local_val is:', f.f_locals['local_var'])
del f
def caller():
local_var = "defined outside"
show_stack()
caller()
Best practice: glue
• For performance and compile-time checking, we do not want to use
Python

• For flow control and configuration, we do not want to use C++

• Glue tools:

• Python C API : https://docs.python.org/3/c-api/index.html

• Cython : https://cython.org/

• Boost.Python : https://www.boost.org/doc/libs/1_74_0/libs/python/
doc/html/index.html

• Pybind11 : https://github.com/pybind/pybind11
13
I only listed tools that require manual generation of wrapping code and compilation.
Pick pybind11
14
Tools Pros Cons
Python C API always available
manual reference count is hard to
get right
Cython easy to use no comprehensive C++ support
Boost.Python comprehensive C++ support requires boost
Pybind11
easy-to-use comprehensive
modern C++ support
require C++11
Build system: cmake
• Again, there are many tools for build systems

• Plain make

• Python distutils, setuptools, pip, etc.

• cmake

• cmake:

• Cross platform

• Consistent and up-to-date support
15
How cmake works
16
#include <pybind11/pybind11.h>
#include <pybind11/stl.h>
#include <cmath>
std::vector<double> distance(
std::vector<double> const & x
, std::vector<double> const & y) {
std::vector<double> r(x.size());
for (size_t i = 0 ; i < x.size() ; ++i)
{
r[i] = std::hypot(x.at(i), y.at(i));
}
return r;
}
PYBIND11_MODULE(_pybmod, mod) {
namespace py = pybind11;
mod.doc() = "simple pybind11 module";
mod.def("distance", &distance);
}
cmake_minimum_required(VERSION 3.9)
project(pybmod)
find_package(pybind11 REQUIRED)
include_directories(${pybind11_INCLUDE_DIRS})
pybind11_add_module(
_pybmod
pybmod.cpp
)
target_link_libraries(_pybmod
PRIVATE ${MKL_LINKLINE})
set_target_properties(_pybmod
PROPERTIES CXX_VISIBILITY_PRESET "default")
add_custom_target(_pybmod_py
COMMAND ${CMAKE_COMMAND} -E
copy $<TARGET_FILE:_pybmod>
${PROJECT_SOURCE_DIR}
DEPENDS _pybmod)
$ mkdir -p build ; cd build
$ cmake .. -DCMAKE_BUILD_TYPE=Release
$ make -C build _pybmod_py
$ python3 -c 'import _pybmod ; print(_pybmod.distance([3, 8], [4, 6]))'
[5.0, 10.0]
CMakeLists.txt pybmod.cpp
Zero-copy: do it where it fits
Python app C++ app
C++
container
Ndarray
manage
access
Python app C++ app
C++
container
Ndarray
manage
accessa11 a12 ⋯ a1n a21 ⋯ am1 ⋯ amn a11 a12 ⋯ a1n a21 ⋯ am1 ⋯ amn
memory buffer shared across language memory buffer shared across language
Top (Python) - down (C++) Bottom (C++) - up (Python)
Python app C++ app
a11 a12 ⋯ a1n a21 ⋯ am1 ⋯ amn
memory buffer shared across language
Ndarray
C++
container
17
Not recom
m
ended
for performance
Share buffer from C++
Two handy approaches by pybind11
18
cls
.def_buffer
(
[](wrapped_type & self)
{
namespace py = pybind11;
return py::buffer_info
(
self.data() /* Pointer to buffer */
, sizeof(int8_t) /* Size of one scalar */
, py::format_descriptor<char>::format()
/* Python struct-style format descriptor */
, 1 /* Number of dimensions */
, { self.size() } /* Buffer dimensions */
, { 1 } /* Strides (in bytes) for each index */
);
}
)
cls
.def_property_readonly
(
"ndarray"
, [](wrapped_type & self)
{
namespace py = pybind11;
return py::array
(
py::detail::npy_format_descriptor<int8_t>::dtype()
/* Numpy dtype */
, { self.size() } /* Buffer dimensions */
, { 1 } /* Strides (in bytes) for each index */
, self.data() /* Pointer to buffer */
, py::cast(self.shared_from_this())
/* Owning Python object */
);
}
)
.def_property_readonly(“ndarray2”, /* ... */)
Return a distinct numpy ndarray:

• The C++ class can return multiple
buffers
Python buffer protocol:

• The C++ class itself is turned into an
array description
Check assembly
• Make sure compilers generate good code for
performance hotspot

• Checking instead of writing: we don’t
have time to use assembly for everything

• If compilers don’t do a good job, before
jumping into hand-written assembly, consider
to use intrinsics: https://software.intel.com/
sites/landingpage/IntrinsicsGuide/
19
vmovupd ymm0, ymmword [rsi + r9*8]
vmulpd ymm0, ymm0, ymm0
vmovupd ymm1, ymmword [rdx + r9*8]
vmulpd ymm1, ymm1, ymm1
vaddpd ymm0, ymm0, ymm1
vsqrtpd ymm0, ymm0
void calc_distance(
size_t const n
, double const * x
, double const * y
, double * r) {
for (size_t i = 0 ; i < n ; ++i)
{
r[i] = std::sqrt(x[i]*x[i] + y[i]*y[i]);
}
}
movupd xmm0, xmmword [rsi + r8*8]
mulpd xmm0, xmm0
movupd xmm1, xmmword [rdx + r8*8]
mulpd xmm1, xmm1
addpd xmm1, xmm0
sqrtpd xmm0, xmm1
-O3 -mavx2 -O3
# Use redare2 to print the assembly.
r2 -Aqc "e scr.color=0 ; s sym.calc_distance_unsignedlong_doubleconst__doubleconst__double ; pdf" 
_pybmod.cpython-37m-darwin.so
AVX: 256-bit-wide vectorization SSE: 128-bit-wide vectorization
Manage resources
• Python uses reference count to manage object lifecycle.
You can do the in C++ by using shared pointer.
20
• Reference counting involves a lock and is slow. You didn’t
experience it in Python because (i) other parts in Python
are even slower and (ii) you didn’t have a choice.
class ConcreteBuffer
: public std::enable_shared_from_this<ConcreteBuffer>
{
private:
struct ctor_passkey {};
public:
static std::shared_ptr<ConcreteBuffer> construct(size_t nbytes)
{
return std::make_shared<ConcreteBuffer>(nbytes, ctor_passkey());
}
ConcreteBuffer(size_t nbytes, const ctor_passkey &)
: m_nbytes(nbytes)
, m_data(allocate(nbytes))
{}
};
Translate reference count
• If reference count is used in C++, make sure it is correctly
translated between Python.
21
pybind11::class_
<
ConcreteBuffer
, std::shared_ptr<ConcreteBuffer>
>(
mod, "ConcreteBuffer"
, "Contiguous memory buffer"
);
class ConcreteBuffer
: public std::enable_shared_from_this<ConcreteBuffer>
{
private:
struct ctor_passkey {};
public:
static std::shared_ptr<ConcreteBuffer> construct(size_t nbytes)
{
return std::make_shared<ConcreteBuffer>(nbytes, ctor_passkey());
}
ConcreteBuffer(size_t nbytes, const ctor_passkey &)
: m_nbytes(nbytes)
, m_data(allocate(nbytes))
{}
};
• If reference count is used in C++, make sure it is correctly
translated between Python.
Run Python in C++
• When C++ code puts together the architecture, oftentimes
we want to keep Python code as little as possible.

• But Python is used for scripting and it’s necessary to
return Python objects for the scripts.

• Pybind11 allows to write concise C++. It’s much more
maintainable than in-line Python.
22
>>> joined = cppfunction()
>>> print(joined)
one, two, three
py::str cppfunction() {
py::list lst = py::list();
lst.append("one");
lst.append("two");
lst.append("three");
return py::str(", ").attr("join")(lst);
}
// No, don't do this!
PyRun_SimpleString("', '.join('one', 'two', 'three')");
Import and run
Pybind11 API is very Pythonic
23
>>> show_modules()
1101
3000
void show_modules() {
py::module sys = py::module::import("sys");
py::print(py::len(sys.attr("modules")));
py::print(sys.attr("getrecursionlimit")());
}
def show_modules():
import sys
print(len(sys.modules))
print(sys.getrecursionlimit())
Spend time in compilation
to save time in runtime
Compiler does a lot of good. Make use of it as much as
possible.
24
template < typename T >
class SimpleArray
{
public:
using value_type = T;
using shape_type = small_vector<size_t>;
// Determine content value element size during compile time.
static constexpr size_t ITEMSIZE = sizeof(value_type);
static constexpr size_t itemsize() { return ITEMSIZE; }
// Straight constructor.
explicit SimpleArray(size_t length)
: m_buffer(ConcreteBuffer::construct(length * ITEMSIZE))
, m_shape{length}
, m_stride{1}
{}
template< class InputIt > SimpleArray(InputIt first, InputIt last)
: SimpleArray(last-first)
{
// Let STL decide how to optimize memory copy.
std::copy(first, last, data());
}
};
Compiler reduces runtime
errors
• Wrap the wrappers: reduce duplicated code

• Do it in compile time to reduce runtime errors
25
template
<
class Wrapper
, class Wrapped
, class Holder = std::unique_ptr<Wrapped>
, class WrappedBase = Wrapped
>
class WrapBase
{
public:
using wrapper_type = Wrapper;
using wrapped_type = Wrapped;
using wrapped_base_type = WrappedBase;
using holder_type = Holder;
// Treat inheritance hierarchy.
using class_ = typename std::conditional_t
<
std::is_same< Wrapped, WrappedBase >::value
, pybind11::class_
< wrapped_type, holder_type >
, pybind11::class_
<wrapped_type, wrapped_base_type, holder_type>
>;
// Singleton.
static wrapper_type & commit(pybind11::module & mod)
{
static wrapper_type derived(mod);
return derived;
}
class_ & cls() { return m_cls; }
protected:
// Register through construction.
template <typename... Extra>
WrapBase(
pybind11::module & mod
, char const * pyname, char const * pydoc
, const Extra & ... extra
)
: m_cls(mod, pyname, pydoc, extra ...)
{}
private:
class_ m_cls;
};
Decouple resource
management from algorithms
26
• Fixed-size
contiguous data
buffer
template < typename T >
class SimpleArray
{
public:
using value_type = T;
using shape_type = small_vector<size_t>;
static constexpr size_t ITEMSIZE = sizeof(value_type);
explicit SimpleArray(size_t length)
: m_buffer(ConcreteBuffer::construct(length * ITEMSIZE))
, m_shape{length}
, m_stride{1}
{}
private:
std::shared_ptr<ConcreteBuffer> m_buffer;
shape_type m_shape;
shape_type m_stride;
};
class ConcreteBuffer
: public std::enable_shared_from_this<ConcreteBuffer>
{
private:
struct ctor_passkey {};
public:
static std::shared_ptr<ConcreteBuffer> construct(size_t nbytes)
{ return std::make_shared<ConcreteBuffer>(nbytes, ctor_passkey()); }
ConcreteBuffer(size_t nbytes, const ctor_passkey &)
: m_nbytes(nbytes), m_data(allocate(nbytes))
{}
private:
using unique_ptr_type =
std::unique_ptr<int8_t, std::default_delete<int8_t[]>>;
size_t m_nbytes;
unique_ptr_type m_data;
};
• Descriptive data
object owning the
data buffer
Encapsulate complex calculation
without losing performance
27
// A solution element.
class Selm : public ElementBase<Selm>
{
public:
value_type dxneg() const { return x()-xneg(); }
value_type dxpos() const { return xpos()-x(); }
value_type xctr() const { return (xneg()+xpos())/2; }
value_type const & so0(size_t iv) const { return field().so0(xindex(), iv); }
value_type & so0(size_t iv) { return field().so0(xindex(), iv); }
value_type const & so1(size_t iv) const { return field().so1(xindex(), iv); }
value_type & so1(size_t iv) { return field().so1(xindex(), iv); }
value_type const & cfl() const { return field().cfl(xindex()); }
value_type & cfl() { return field().cfl(xindex()); }
value_type xn(size_t iv) const { return field().kernel().calc_xn(*this, iv); }
value_type xp(size_t iv) const { return field().kernel().calc_xp(*this, iv); }
value_type tn(size_t iv) const { return field().kernel().calc_tn(*this, iv); }
value_type tp(size_t iv) const { return field().kernel().calc_tp(*this, iv); }
value_type so0p(size_t iv) const { return field().kernel().calc_so0p(*this, iv); }
void update_cfl() { return field().kernel().update_cfl(*this); }
};
xj− 1
2
xs
j− 1
2
xj+ 1
2
xs
j+ 1
2
x+
j− 1
2
= x−
j+ 1
2
xj = xs
j
(h∗
)n
j− 1
2 ,+
· (0, −∆x+
j− 1
2
) (h∗
)n
j+ 1
2 ,−
· (0, −∆x−
j+ 1
2
)
h
n+ 1
2
j · (0, ∆xj)
(h∗
)n,+
j− 1
2
· (−∆t
2 , 0) (h∗
)n,+
j+ 1
2
· (∆t
2 , 0)
3 solution elements around a compound conservation element
(Part of a code for the space-time conservation element and solution element (CESE) method)
Conclusion
• Speed is the king for numerical calculation

• Use Python for configuration and C++ for number
crunching: glue them

• Keep in mind: Software engineering, zero-copy data
buffer, read assembly, manage resources, use compiler
for speed and safety

• More information: https://github.com/yungyuc/nsd
28
Thank you
29

Weitere ähnliche Inhalte

Was ist angesagt?

仮想化環境におけるパケットフォワーディング
仮想化環境におけるパケットフォワーディング仮想化環境におけるパケットフォワーディング
仮想化環境におけるパケットフォワーディング
Takuya ASADA
 
CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016] CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016]
IO Visor Project
 

Was ist angesagt? (20)

レシピの作り方入門
レシピの作り方入門レシピの作り方入門
レシピの作り方入門
 
BuildKitの概要と最近の機能
BuildKitの概要と最近の機能BuildKitの概要と最近の機能
BuildKitの概要と最近の機能
 
Nmapの真実(続)
Nmapの真実(続)Nmapの真実(続)
Nmapの真実(続)
 
20190625 OpenACC 講習会 第3部
20190625 OpenACC 講習会 第3部20190625 OpenACC 講習会 第3部
20190625 OpenACC 講習会 第3部
 
UM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of SoftwareUM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of Software
 
Threads and Callbacks for Embedded Python
Threads and Callbacks for Embedded PythonThreads and Callbacks for Embedded Python
Threads and Callbacks for Embedded Python
 
Kubernetes にこれから入るかもしれない注目機能!(2022年11月版) / TechFeed Experts Night #7 〜 コンテナ技術を語る
Kubernetes にこれから入るかもしれない注目機能!(2022年11月版) / TechFeed Experts Night #7 〜 コンテナ技術を語るKubernetes にこれから入るかもしれない注目機能!(2022年11月版) / TechFeed Experts Night #7 〜 コンテナ技術を語る
Kubernetes にこれから入るかもしれない注目機能!(2022年11月版) / TechFeed Experts Night #7 〜 コンテナ技術を語る
 
きつねさんでもわかるLlvm読書会 第2回
きつねさんでもわかるLlvm読書会 第2回きつねさんでもわかるLlvm読書会 第2回
きつねさんでもわかるLlvm読書会 第2回
 
淺談探索 Linux 系統設計之道
淺談探索 Linux 系統設計之道 淺談探索 Linux 系統設計之道
淺談探索 Linux 系統設計之道
 
DockerでWordPressサイトを開発してみよう
DockerでWordPressサイトを開発してみようDockerでWordPressサイトを開発してみよう
DockerでWordPressサイトを開発してみよう
 
TMUX Rocks!
TMUX Rocks!TMUX Rocks!
TMUX Rocks!
 
仮想化環境におけるパケットフォワーディング
仮想化環境におけるパケットフォワーディング仮想化環境におけるパケットフォワーディング
仮想化環境におけるパケットフォワーディング
 
Enhancing Network and Runtime Security with Cilium and Tetragon by Raymond De...
Enhancing Network and Runtime Security with Cilium and Tetragon by Raymond De...Enhancing Network and Runtime Security with Cilium and Tetragon by Raymond De...
Enhancing Network and Runtime Security with Cilium and Tetragon by Raymond De...
 
DPDKによる高速コンテナネットワーキング
DPDKによる高速コンテナネットワーキングDPDKによる高速コンテナネットワーキング
DPDKによる高速コンテナネットワーキング
 
CentOS Linux 8 の EOL と対応策の検討
CentOS Linux 8 の EOL と対応策の検討CentOS Linux 8 の EOL と対応策の検討
CentOS Linux 8 の EOL と対応策の検討
 
JavaでCPUを使い倒す! ~Java 9 以降の CPU 最適化を覗いてみる~(NTTデータ テクノロジーカンファレンス 2019 講演資料、2019...
JavaでCPUを使い倒す! ~Java 9 以降の CPU 最適化を覗いてみる~(NTTデータ テクノロジーカンファレンス 2019 講演資料、2019...JavaでCPUを使い倒す! ~Java 9 以降の CPU 最適化を覗いてみる~(NTTデータ テクノロジーカンファレンス 2019 講演資料、2019...
JavaでCPUを使い倒す! ~Java 9 以降の CPU 最適化を覗いてみる~(NTTデータ テクノロジーカンファレンス 2019 講演資料、2019...
 
Linuxのsemaphoreとmutexを見る 
Linuxのsemaphoreとmutexを見る Linuxのsemaphoreとmutexを見る 
Linuxのsemaphoreとmutexを見る 
 
CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016] CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016]
 
containerdの概要と最近の機能
containerdの概要と最近の機能containerdの概要と最近の機能
containerdの概要と最近の機能
 
Elixir入門「第1回:パターンマッチ&パイプでJSONパースアプリをサクっと書いてみる」
Elixir入門「第1回:パターンマッチ&パイプでJSONパースアプリをサクっと書いてみる」Elixir入門「第1回:パターンマッチ&パイプでJSONパースアプリをサクっと書いてみる」
Elixir入門「第1回:パターンマッチ&パイプでJSONパースアプリをサクっと書いてみる」
 

Ähnlich wie Notes about moving from python to c++ py contw 2020

An Overview Of Python With Functional Programming
An Overview Of Python With Functional ProgrammingAn Overview Of Python With Functional Programming
An Overview Of Python With Functional Programming
Adam Getchell
 
#OOP_D_ITS - 2nd - C++ Getting Started
#OOP_D_ITS - 2nd - C++ Getting Started#OOP_D_ITS - 2nd - C++ Getting Started
#OOP_D_ITS - 2nd - C++ Getting Started
Hadziq Fabroyir
 
20145-5SumII_CSC407_assign1.htmlCSC 407 Computer Systems II.docx
20145-5SumII_CSC407_assign1.htmlCSC 407 Computer Systems II.docx20145-5SumII_CSC407_assign1.htmlCSC 407 Computer Systems II.docx
20145-5SumII_CSC407_assign1.htmlCSC 407 Computer Systems II.docx
eugeniadean34240
 

Ähnlich wie Notes about moving from python to c++ py contw 2020 (20)

Cluj.py Meetup: Extending Python in C
Cluj.py Meetup: Extending Python in CCluj.py Meetup: Extending Python in C
Cluj.py Meetup: Extending Python in C
 
Object Oriented Technologies
Object Oriented TechnologiesObject Oriented Technologies
Object Oriented Technologies
 
C++ manual Report Full
C++ manual Report FullC++ manual Report Full
C++ manual Report Full
 
Boost.Python: C++ and Python Integration
Boost.Python: C++ and Python IntegrationBoost.Python: C++ and Python Integration
Boost.Python: C++ and Python Integration
 
Pemrograman Python untuk Pemula
Pemrograman Python untuk PemulaPemrograman Python untuk Pemula
Pemrograman Python untuk Pemula
 
Unit 2function in python.pptx
Unit 2function in python.pptxUnit 2function in python.pptx
Unit 2function in python.pptx
 
C++ Programming
C++ ProgrammingC++ Programming
C++ Programming
 
An Overview Of Python With Functional Programming
An Overview Of Python With Functional ProgrammingAn Overview Of Python With Functional Programming
An Overview Of Python With Functional Programming
 
Lecture2.ppt
Lecture2.pptLecture2.ppt
Lecture2.ppt
 
C++ Programming
C++ ProgrammingC++ Programming
C++ Programming
 
Python Functions Tutorial | Working With Functions In Python | Python Trainin...
Python Functions Tutorial | Working With Functions In Python | Python Trainin...Python Functions Tutorial | Working With Functions In Python | Python Trainin...
Python Functions Tutorial | Working With Functions In Python | Python Trainin...
 
Pres_python_talakhoury_26_09_2023.pdf
Pres_python_talakhoury_26_09_2023.pdfPres_python_talakhoury_26_09_2023.pdf
Pres_python_talakhoury_26_09_2023.pdf
 
#OOP_D_ITS - 2nd - C++ Getting Started
#OOP_D_ITS - 2nd - C++ Getting Started#OOP_D_ITS - 2nd - C++ Getting Started
#OOP_D_ITS - 2nd - C++ Getting Started
 
C Programming Tutorial - www.infomtec.com
C Programming Tutorial - www.infomtec.comC Programming Tutorial - www.infomtec.com
C Programming Tutorial - www.infomtec.com
 
Euro python2011 High Performance Python
Euro python2011 High Performance PythonEuro python2011 High Performance Python
Euro python2011 High Performance Python
 
Pythonpresent
PythonpresentPythonpresent
Pythonpresent
 
TWINS: OOP and FP - Warburton
TWINS: OOP and FP - WarburtonTWINS: OOP and FP - Warburton
TWINS: OOP and FP - Warburton
 
Chapter 1 Class 12 Computer Science Unit 1
Chapter 1 Class 12 Computer Science Unit 1Chapter 1 Class 12 Computer Science Unit 1
Chapter 1 Class 12 Computer Science Unit 1
 
Revision of the basics of python1 (1).pdf
Revision of the basics of python1 (1).pdfRevision of the basics of python1 (1).pdf
Revision of the basics of python1 (1).pdf
 
20145-5SumII_CSC407_assign1.htmlCSC 407 Computer Systems II.docx
20145-5SumII_CSC407_assign1.htmlCSC 407 Computer Systems II.docx20145-5SumII_CSC407_assign1.htmlCSC 407 Computer Systems II.docx
20145-5SumII_CSC407_assign1.htmlCSC 407 Computer Systems II.docx
 

Mehr von Yung-Yu Chen

Mehr von Yung-Yu Chen (10)

Write Python for Speed
Write Python for SpeedWrite Python for Speed
Write Python for Speed
 
SimpleArray between Python and C++
SimpleArray between Python and C++SimpleArray between Python and C++
SimpleArray between Python and C++
 
Write code and find a job
Write code and find a jobWrite code and find a job
Write code and find a job
 
On the necessity and inapplicability of python
On the necessity and inapplicability of pythonOn the necessity and inapplicability of python
On the necessity and inapplicability of python
 
Take advantage of C++ from Python
Take advantage of C++ from PythonTake advantage of C++ from Python
Take advantage of C++ from Python
 
Start Wrap Episode 11: A New Rope
Start Wrap Episode 11: A New RopeStart Wrap Episode 11: A New Rope
Start Wrap Episode 11: A New Rope
 
Harmonic Stack for Speed
Harmonic Stack for SpeedHarmonic Stack for Speed
Harmonic Stack for Speed
 
Your interactive computing
Your interactive computingYour interactive computing
Your interactive computing
 
Engineer Engineering Software
Engineer Engineering SoftwareEngineer Engineering Software
Engineer Engineering Software
 
Craftsmanship in Computational Work
Craftsmanship in Computational WorkCraftsmanship in Computational Work
Craftsmanship in Computational Work
 

Kürzlich hochgeladen

dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Sérgio Sacani
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
levieagacer
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
Areesha Ahmad
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
PirithiRaju
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
1301aanya
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
PirithiRaju
 

Kürzlich hochgeladen (20)

dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
 
IDENTIFICATION OF THE LIVING- forensic medicine
IDENTIFICATION OF THE LIVING- forensic medicineIDENTIFICATION OF THE LIVING- forensic medicine
IDENTIFICATION OF THE LIVING- forensic medicine
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
Unit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 oUnit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 o
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 

Notes about moving from python to c++ py contw 2020

  • 1. Notes about moving from Python to C++ Yung-Yu Chen (@yungyuc) PyConTW 2020, Tainan, Taiwan 1
  • 2. Why move? • Numerical software: Computer programs to solve scientific or mathematic problems. • Other names: Mathematical software, scientific software, technical software. • Python is a popular language for application experts to describe the problems and solutions, because it is easy to use. • Most of the computing systems (the numerical software) are designed in a hybrid architecture. • The computing kernel uses C++. • Python is chosen for the user-level API. 2
  • 3. What I do • By training, I am a computational scientist, focusing on applications of continuum mechanics. (I am not a computer scientist.) • In my day job, I write high-performance code for semiconductor applications of computational geometry and photolithography. • In my spare time, I am teaching a course ‘numerical software development’ in the department of computer science in NCTU. 3 twitter: https://twitter.com/yungyuc
 linkedin: https://www.linkedin.com/in/yungyuc/.
  • 4. Example: PDEs 4 Numerical simulations of conservation laws: ∂u ∂t + 3 ∑ k=1 ∂F(k) (u) ∂xk = 0 Use case: stress waves in 
 anisotropic solids Use case: compressible flows
  • 5. Example: OPC 5 photoresist silicon substrate photomask light source Photolithography in semiconductor fabrication wave length is only hundreds of nm image I want to project on the PR shape I need on the mask Optical proximity correction (OPC) (smaller than the wave length) write code to make it happen
  • 6. Python: the good is the bad • Dynamicity enables convention over compilation for productivity • It is so when you have good testing and documentation import math class Point: def __init__(self, x, y): self.x = x self.y = y @property def length(self): return math.hypot(self.x, self.y) 5.0print(Point(3, 4).length) print(Point("3", "4").length) Traceback (most recent call last): File "ambiguity.py", line 10, in <module> print(Point("3", "4").length) File "ambiguity.py", line 8, in length return math.hypot(self.x, self.y) TypeError: a float is required 6
  • 7. What is fast? worker(Data&, unsigned long, int): mov DWORD PTR [rdi+rsi*4], edx mov rax, rdi ret class Data { int m_storage[1024]; public: int & operator[] (size_t it) { return m_storage[it]; } }; Data & worker(Data & data, size_t idx, int value) { data[idx] = value; return data; } High-level language Machine code 7
  • 8. Dynamic typing is slow • Python trades runtime for convenience; everything is indirect and incurs runtime overhead def work(data, key, name, value): # Assign value to the data as a container with key. data[key] = value # Assign value to the data as an object with name. setattr(data, name, value) return data 8 • The innocent-looking Python code uses many checking functions under the hood
  • 9. int PyObject_SetItem(PyObject *o, PyObject *key, PyObject *value) { PyMappingMethods *m; if (o == NULL || key == NULL || value == NULL) { null_error(); return -1; } m = o->ob_type->tp_as_mapping; if (m && m->mp_ass_subscript) return m->mp_ass_subscript(o, key, value); if (o->ob_type->tp_as_sequence) { if (PyIndex_Check(key)) { Py_ssize_t key_value; key_value = PyNumber_AsSsize_t(key, PyExc_IndexError); if (key_value == -1 && PyErr_Occurred()) return -1; return PySequence_SetItem(o, key_value, value); } else if (o->ob_type->tp_as_sequence->sq_ass_item) { type_error(/* ... omit ... */); return -1; } } type_error(/* ... omit ... */); return -1; } int PyObject_SetAttr(PyObject *v, PyObject *name, PyObject *value) { PyTypeObject *tp = Py_TYPE(v); int err; if (!PyUnicode_Check(name)) { PyErr_Format(PyExc_TypeError, /* ... omit ... */); return -1; } Py_INCREF(name); PyUnicode_InternInPlace(&name); if (tp->tp_setattro != NULL) { err = (*tp->tp_setattro)(v, name, value); Py_DECREF(name); return err; } if (tp->tp_setattr != NULL) { const char *name_str = PyUnicode_AsUTF8(name); if (name_str == NULL) { Py_DECREF(name); return -1; } err = (*tp->tp_setattr)(v, (char *)name_str, value); Py_DECREF(name); return err; } Py_DECREF(name); _PyObject_ASSERT(name, name->ob_refcnt >= 1); if (tp->tp_getattr == NULL && tp->tp_getattro == NULL) PyErr_Format(PyExc_TypeError, /* ... omit ... */); else PyErr_Format(PyExc_TypeError, /* ... omit ... */); return -1; } data[key] = value setattr(data, name, value) (in Objects/abstract.c) (in Objects/object.c) static PyObject *__pyx_pf_6simple_work( PyObject *__pyx_v_data, PyObject *__pyx_v_key , PyObject *__pyx_v_name, PyObject *__pyx_v_value ) { PyObject_SetItem(__pyx_v_data, __pyx_v_key, __pyx_v_value); PyObject_SetAttr(__pyx_v_data, __pyx_v_name, __pyx_v_value); return __pyx_v_data; } def work(data, key, name, value): data[key] = value setattr(data, name, value) return data mapping sequence attribute name check check to set 9
  • 10. Speed is the king To pursue high-performance and flexibility simultaneously, the systems need to use the best tools at the proper layers. This is a path to develop such systems. 10 Python controls execution flow C++ manages resources Use C++ to generate fast assembly code Glue Python and C++
  • 11. PyObject uses malloc PyObject_MALLOC() is a wrapper to the standard C memory allocator malloc(). 11 PyObject * _PyObject_New(PyTypeObject *tp) { PyObject *op; op = (PyObject *) PyObject_MALLOC(_PyObject_SIZE(tp)); if (op == NULL) return PyErr_NoMemory(); return PyObject_INIT(op, tp); }
  • 12. Python call stack • Python maintains its own stack • It’s different from that of the ‘real machine’ • The two (C/C++ and Python) call stacks work very differently 12 #0 line 11: show_stack #1 line 23: caller #2 line 25: <module>: top-level module local_val is: defined outside import inspect def show_stack(): # Print calling stacks. f = inspect.currentframe() i = 0 while f: name = f.f_code.co_name if '<module>' == name: name += ': top-level module' print("#%d line %d: %s" % (i, f.f_lineno, name)) f = f.f_back i += 1 # Obtain a local variable defined in an outside stack. f = inspect.currentframe().f_back print('local_val is:', f.f_locals['local_var']) del f def caller(): local_var = "defined outside" show_stack() caller()
  • 13. Best practice: glue • For performance and compile-time checking, we do not want to use Python • For flow control and configuration, we do not want to use C++ • Glue tools: • Python C API : https://docs.python.org/3/c-api/index.html • Cython : https://cython.org/ • Boost.Python : https://www.boost.org/doc/libs/1_74_0/libs/python/ doc/html/index.html • Pybind11 : https://github.com/pybind/pybind11 13 I only listed tools that require manual generation of wrapping code and compilation.
  • 14. Pick pybind11 14 Tools Pros Cons Python C API always available manual reference count is hard to get right Cython easy to use no comprehensive C++ support Boost.Python comprehensive C++ support requires boost Pybind11 easy-to-use comprehensive modern C++ support require C++11
  • 15. Build system: cmake • Again, there are many tools for build systems • Plain make • Python distutils, setuptools, pip, etc. • cmake • cmake: • Cross platform • Consistent and up-to-date support 15
  • 16. How cmake works 16 #include <pybind11/pybind11.h> #include <pybind11/stl.h> #include <cmath> std::vector<double> distance( std::vector<double> const & x , std::vector<double> const & y) { std::vector<double> r(x.size()); for (size_t i = 0 ; i < x.size() ; ++i) { r[i] = std::hypot(x.at(i), y.at(i)); } return r; } PYBIND11_MODULE(_pybmod, mod) { namespace py = pybind11; mod.doc() = "simple pybind11 module"; mod.def("distance", &distance); } cmake_minimum_required(VERSION 3.9) project(pybmod) find_package(pybind11 REQUIRED) include_directories(${pybind11_INCLUDE_DIRS}) pybind11_add_module( _pybmod pybmod.cpp ) target_link_libraries(_pybmod PRIVATE ${MKL_LINKLINE}) set_target_properties(_pybmod PROPERTIES CXX_VISIBILITY_PRESET "default") add_custom_target(_pybmod_py COMMAND ${CMAKE_COMMAND} -E copy $<TARGET_FILE:_pybmod> ${PROJECT_SOURCE_DIR} DEPENDS _pybmod) $ mkdir -p build ; cd build $ cmake .. -DCMAKE_BUILD_TYPE=Release $ make -C build _pybmod_py $ python3 -c 'import _pybmod ; print(_pybmod.distance([3, 8], [4, 6]))' [5.0, 10.0] CMakeLists.txt pybmod.cpp
  • 17. Zero-copy: do it where it fits Python app C++ app C++ container Ndarray manage access Python app C++ app C++ container Ndarray manage accessa11 a12 ⋯ a1n a21 ⋯ am1 ⋯ amn a11 a12 ⋯ a1n a21 ⋯ am1 ⋯ amn memory buffer shared across language memory buffer shared across language Top (Python) - down (C++) Bottom (C++) - up (Python) Python app C++ app a11 a12 ⋯ a1n a21 ⋯ am1 ⋯ amn memory buffer shared across language Ndarray C++ container 17 Not recom m ended for performance
  • 18. Share buffer from C++ Two handy approaches by pybind11 18 cls .def_buffer ( [](wrapped_type & self) { namespace py = pybind11; return py::buffer_info ( self.data() /* Pointer to buffer */ , sizeof(int8_t) /* Size of one scalar */ , py::format_descriptor<char>::format() /* Python struct-style format descriptor */ , 1 /* Number of dimensions */ , { self.size() } /* Buffer dimensions */ , { 1 } /* Strides (in bytes) for each index */ ); } ) cls .def_property_readonly ( "ndarray" , [](wrapped_type & self) { namespace py = pybind11; return py::array ( py::detail::npy_format_descriptor<int8_t>::dtype() /* Numpy dtype */ , { self.size() } /* Buffer dimensions */ , { 1 } /* Strides (in bytes) for each index */ , self.data() /* Pointer to buffer */ , py::cast(self.shared_from_this()) /* Owning Python object */ ); } ) .def_property_readonly(“ndarray2”, /* ... */) Return a distinct numpy ndarray: • The C++ class can return multiple buffers Python buffer protocol: • The C++ class itself is turned into an array description
  • 19. Check assembly • Make sure compilers generate good code for performance hotspot • Checking instead of writing: we don’t have time to use assembly for everything • If compilers don’t do a good job, before jumping into hand-written assembly, consider to use intrinsics: https://software.intel.com/ sites/landingpage/IntrinsicsGuide/ 19 vmovupd ymm0, ymmword [rsi + r9*8] vmulpd ymm0, ymm0, ymm0 vmovupd ymm1, ymmword [rdx + r9*8] vmulpd ymm1, ymm1, ymm1 vaddpd ymm0, ymm0, ymm1 vsqrtpd ymm0, ymm0 void calc_distance( size_t const n , double const * x , double const * y , double * r) { for (size_t i = 0 ; i < n ; ++i) { r[i] = std::sqrt(x[i]*x[i] + y[i]*y[i]); } } movupd xmm0, xmmword [rsi + r8*8] mulpd xmm0, xmm0 movupd xmm1, xmmword [rdx + r8*8] mulpd xmm1, xmm1 addpd xmm1, xmm0 sqrtpd xmm0, xmm1 -O3 -mavx2 -O3 # Use redare2 to print the assembly. r2 -Aqc "e scr.color=0 ; s sym.calc_distance_unsignedlong_doubleconst__doubleconst__double ; pdf" _pybmod.cpython-37m-darwin.so AVX: 256-bit-wide vectorization SSE: 128-bit-wide vectorization
  • 20. Manage resources • Python uses reference count to manage object lifecycle. You can do the in C++ by using shared pointer. 20 • Reference counting involves a lock and is slow. You didn’t experience it in Python because (i) other parts in Python are even slower and (ii) you didn’t have a choice. class ConcreteBuffer : public std::enable_shared_from_this<ConcreteBuffer> { private: struct ctor_passkey {}; public: static std::shared_ptr<ConcreteBuffer> construct(size_t nbytes) { return std::make_shared<ConcreteBuffer>(nbytes, ctor_passkey()); } ConcreteBuffer(size_t nbytes, const ctor_passkey &) : m_nbytes(nbytes) , m_data(allocate(nbytes)) {} };
  • 21. Translate reference count • If reference count is used in C++, make sure it is correctly translated between Python. 21 pybind11::class_ < ConcreteBuffer , std::shared_ptr<ConcreteBuffer> >( mod, "ConcreteBuffer" , "Contiguous memory buffer" ); class ConcreteBuffer : public std::enable_shared_from_this<ConcreteBuffer> { private: struct ctor_passkey {}; public: static std::shared_ptr<ConcreteBuffer> construct(size_t nbytes) { return std::make_shared<ConcreteBuffer>(nbytes, ctor_passkey()); } ConcreteBuffer(size_t nbytes, const ctor_passkey &) : m_nbytes(nbytes) , m_data(allocate(nbytes)) {} }; • If reference count is used in C++, make sure it is correctly translated between Python.
  • 22. Run Python in C++ • When C++ code puts together the architecture, oftentimes we want to keep Python code as little as possible. • But Python is used for scripting and it’s necessary to return Python objects for the scripts. • Pybind11 allows to write concise C++. It’s much more maintainable than in-line Python. 22 >>> joined = cppfunction() >>> print(joined) one, two, three py::str cppfunction() { py::list lst = py::list(); lst.append("one"); lst.append("two"); lst.append("three"); return py::str(", ").attr("join")(lst); } // No, don't do this! PyRun_SimpleString("', '.join('one', 'two', 'three')");
  • 23. Import and run Pybind11 API is very Pythonic 23 >>> show_modules() 1101 3000 void show_modules() { py::module sys = py::module::import("sys"); py::print(py::len(sys.attr("modules"))); py::print(sys.attr("getrecursionlimit")()); } def show_modules(): import sys print(len(sys.modules)) print(sys.getrecursionlimit())
  • 24. Spend time in compilation to save time in runtime Compiler does a lot of good. Make use of it as much as possible. 24 template < typename T > class SimpleArray { public: using value_type = T; using shape_type = small_vector<size_t>; // Determine content value element size during compile time. static constexpr size_t ITEMSIZE = sizeof(value_type); static constexpr size_t itemsize() { return ITEMSIZE; } // Straight constructor. explicit SimpleArray(size_t length) : m_buffer(ConcreteBuffer::construct(length * ITEMSIZE)) , m_shape{length} , m_stride{1} {} template< class InputIt > SimpleArray(InputIt first, InputIt last) : SimpleArray(last-first) { // Let STL decide how to optimize memory copy. std::copy(first, last, data()); } };
  • 25. Compiler reduces runtime errors • Wrap the wrappers: reduce duplicated code • Do it in compile time to reduce runtime errors 25 template < class Wrapper , class Wrapped , class Holder = std::unique_ptr<Wrapped> , class WrappedBase = Wrapped > class WrapBase { public: using wrapper_type = Wrapper; using wrapped_type = Wrapped; using wrapped_base_type = WrappedBase; using holder_type = Holder; // Treat inheritance hierarchy. using class_ = typename std::conditional_t < std::is_same< Wrapped, WrappedBase >::value , pybind11::class_ < wrapped_type, holder_type > , pybind11::class_ <wrapped_type, wrapped_base_type, holder_type> >; // Singleton. static wrapper_type & commit(pybind11::module & mod) { static wrapper_type derived(mod); return derived; } class_ & cls() { return m_cls; } protected: // Register through construction. template <typename... Extra> WrapBase( pybind11::module & mod , char const * pyname, char const * pydoc , const Extra & ... extra ) : m_cls(mod, pyname, pydoc, extra ...) {} private: class_ m_cls; };
  • 26. Decouple resource management from algorithms 26 • Fixed-size contiguous data buffer template < typename T > class SimpleArray { public: using value_type = T; using shape_type = small_vector<size_t>; static constexpr size_t ITEMSIZE = sizeof(value_type); explicit SimpleArray(size_t length) : m_buffer(ConcreteBuffer::construct(length * ITEMSIZE)) , m_shape{length} , m_stride{1} {} private: std::shared_ptr<ConcreteBuffer> m_buffer; shape_type m_shape; shape_type m_stride; }; class ConcreteBuffer : public std::enable_shared_from_this<ConcreteBuffer> { private: struct ctor_passkey {}; public: static std::shared_ptr<ConcreteBuffer> construct(size_t nbytes) { return std::make_shared<ConcreteBuffer>(nbytes, ctor_passkey()); } ConcreteBuffer(size_t nbytes, const ctor_passkey &) : m_nbytes(nbytes), m_data(allocate(nbytes)) {} private: using unique_ptr_type = std::unique_ptr<int8_t, std::default_delete<int8_t[]>>; size_t m_nbytes; unique_ptr_type m_data; }; • Descriptive data object owning the data buffer
  • 27. Encapsulate complex calculation without losing performance 27 // A solution element. class Selm : public ElementBase<Selm> { public: value_type dxneg() const { return x()-xneg(); } value_type dxpos() const { return xpos()-x(); } value_type xctr() const { return (xneg()+xpos())/2; } value_type const & so0(size_t iv) const { return field().so0(xindex(), iv); } value_type & so0(size_t iv) { return field().so0(xindex(), iv); } value_type const & so1(size_t iv) const { return field().so1(xindex(), iv); } value_type & so1(size_t iv) { return field().so1(xindex(), iv); } value_type const & cfl() const { return field().cfl(xindex()); } value_type & cfl() { return field().cfl(xindex()); } value_type xn(size_t iv) const { return field().kernel().calc_xn(*this, iv); } value_type xp(size_t iv) const { return field().kernel().calc_xp(*this, iv); } value_type tn(size_t iv) const { return field().kernel().calc_tn(*this, iv); } value_type tp(size_t iv) const { return field().kernel().calc_tp(*this, iv); } value_type so0p(size_t iv) const { return field().kernel().calc_so0p(*this, iv); } void update_cfl() { return field().kernel().update_cfl(*this); } }; xj− 1 2 xs j− 1 2 xj+ 1 2 xs j+ 1 2 x+ j− 1 2 = x− j+ 1 2 xj = xs j (h∗ )n j− 1 2 ,+ · (0, −∆x+ j− 1 2 ) (h∗ )n j+ 1 2 ,− · (0, −∆x− j+ 1 2 ) h n+ 1 2 j · (0, ∆xj) (h∗ )n,+ j− 1 2 · (−∆t 2 , 0) (h∗ )n,+ j+ 1 2 · (∆t 2 , 0) 3 solution elements around a compound conservation element (Part of a code for the space-time conservation element and solution element (CESE) method)
  • 28. Conclusion • Speed is the king for numerical calculation • Use Python for configuration and C++ for number crunching: glue them • Keep in mind: Software engineering, zero-copy data buffer, read assembly, manage resources, use compiler for speed and safety • More information: https://github.com/yungyuc/nsd 28