統計解析言語Rにおける大規模データ管理のためのboost.interprocessの活用

統計解析言語Rにおける
大規模データ管理のための
Boost.Interporcessの活用

2011年12月3日
Boost#7
@sfchaos

自己紹介

 TwitterID: @sfchaos
 職業：データ分析屋
 RやC++等を使って金融，医療，産業などの
データ分析
 Boostは金融をやっていた頃，行列保持・計算
等でublasを少し使用

アジェンダ

1．Why R?
2. What's R?
3．Boost.Interprocessを活用したRの大規
模データ管理
4. その他C++/BoostとRの接点
5. まとめ

1. Why R?
 近年，機械学習，自然言語処理，データマ
イニングなどがブームになりつつある
　“I keep saying that the sexy job in the
next 10 years will be statisticians,” said
Hal Varian, chief economist at Google.
“And I’m not kidding.”

 Rは分析屋が手元で探索的にデータ分析
できるお手軽ツール
 本発表の趣旨は，RにおけるBoostの活用
事例の紹介

2. What's R?

 統計計算とグラフィックスのための言語・環
境
 多様な統計手法(線形・非線形モデル、古
典的統計検定、時系列解析、判別分析、
クラスタリング、その他)とグラフィックスを
提供
 近年，大いに注目を集めている
 各地での勉強会(Tokyo.R, Tsukuba.R, Osaka.R,
Hiroshima.R)

2.1 Rの長所(の一例)

 オブジェクトに対する高い操作性
> # 初項１，末項10，公差1の等差数列
> x <- 1:10
> # xの値を表示する
>x
[1] 1 2 3 4 5 6 7 8 9 10
> # 偶数の項だけを取り出す
> x[x%%2==0]
[1] 2 4 6 8 10

 強力なグラフィクス機能
Averag e Yearly Sun sp o ts
1750 1800 1850 1900 1950

150
spots

100

50

0

150
spots

100
50
0

1750 1800 1850 1900 1950

Year

 最新の手法を用意した豊富なパッケージ群

2.2 Rの短所
 マルチコア/CPUの環境でも基本的に
1CPU

1CPU
 基本的にオンメモリでデータを保持，計算
を実行

1CPU
を実行
 32ビット整数を用いているので，64ビット
OSでもベクトル，行列，配列などのオブジェ
クトの要素数の上限が231-1

1CPU
を実行
 基本的にオブジェクトは値渡しするため，メ
モリを大量に消費

 マルチCPUの環境でも基本的に1CPU
を実行
大規模なデータに対して
処理速度を上げるためには工夫が必要
→高性能計算(High Performance Computing)
 基本的にオブジェクトは値渡しするため，メ
モリを大量に消費

アジェンダ
1．Why R?
2. What's R?
3. Boost.Interprocessを活用した大規模
データ管理
4. その他Boost/C++とRの接点
5. まとめ

3.1 オンメモリの制約条件を超えるために
 Rの標準機能だけではRAMの制約がある
 この課題を解決するために提供されている
パッケージがいくつかある
 bigmemoryパッケージは
Boost.Interprocessを使用して共有メモリ，
メモリマップドファイルを用いたデータ管理
を実現

3.2 Boost.Interprocessの概要
 プロセス間通信や同期の仕組みを簡略化
したライブラリ
 共有メモリ
 メモリマップファイル
 セマフォ，ミューテックス，条件変数，共有メモリやメ
モリマップファイル上のアップグレード可能なミュー
テックス型
 名前付したこれらの同期オブジェクト型．Unixや
Windowsのsem_openやCreateSemaphore APIに似
たもの．
 ファイルロック
 相対的な位置
 メッセージキュー　等々
http://ohkuma.la.coocan.jp/tech/boost/Interproc
ess.html

 今回は，Rの話に関係のある共有メモリ，メ
モリマップドファイルのみを簡単に調査

3.3.1 共有メモリ
 使用するヘッダファイル
boost/interprocess/shared_memory_object.h
pp

 共有メモリセグメントの作成
using boost::interprocess;
// 共有メモリセグメントのオープン・作成
shared_memory_object
shm_obj(open_or_create, "shared_memory",
read_write);
// 共有メモリのサイズの設定(要read_writeモード)
shm_obj.truncate(10000);

 共有メモリセグメントのマッピング
using namespace boost;
mapped_region(shm, read_write);

 共有メモリの破棄
using namespace boost::interprocess;
shared_memory_object::remove(
"shared_memory");

3.3.2 メモリマップドファイル
 使用するヘッダファイル
boost/interprocess/file_mapping.hpp

 ファイルマッピングの作成
file_mapping m_file("/usr/home/file",
read_write)
 メモリ内へのファイルの中身のマッピング
mapped_region region(m_file, read_write);

3.4 bigmemoryを構成するクラス
BigMatrix

巨大行列の
抽象クラス

LocalBigMatrix SharedBigMatrix
ローカルで
共有用巨大行列の
データを保持する
抽象クラス
巨大行列クラス

SharedMemoryBigMatrix FileBackedBigMatrix

共有メモリを用いたメモリマップドファイルを
巨大行列クラス用いた巨大行列クラス

3.4.1 共有用巨大行列の抽象クラス
class SharedBigMatrix : public BigMatrix
{
public:
SharedBigMatrix() : BigMatrix() {_shared=true;}
virtual ~SharedBigMatrix() {}
std::string uuid() const {return _uuid;}
std::string shared_name() const {return _sharedName;}

protected:
virtual bool destroy()=0;
bool create_uuid(); uuidの作成
bool uuid(const std::string &uuid) {_uuid=uuid; return true;}
std::string _uuid;
std::string _sharedName;
MappedRegionPtrs _dataRegionPtrs;
};

typedef boost::interprocess::mapped_region MappedRegion;
typedef boost::shared_ptr<MappedRegion> MappedRegionPtr;
typedef vector<MappedRegionPtr> MappedRegionPtrs;

bool SharedBigMatrix::create_uuid()
{
try{
stringstream ss;
boost::uuids::basic_random_generator<boost::mt19937> gen;
boost::uuids::uuid u = gen();
ss << u;
_uuid = ss.str();
return true;
} catch(std::exception &e) {
printf("%sn", e.what());
printf("%s line %dn", __FILE__, __LINE__);
return false;
}
}

3.4.2 共有メモリを用いた巨大行列クラス
class SharedMemoryBigMatrix : public SharedBigMatrix
{
public:
SharedMemoryBigMatrix():SharedBigMatrix(){};
virtual ~SharedMemoryBigMatrix(){destroy();};
virtual bool create( const index_type numRow, const index_type
numCol, ①巨大行列の生成
const int matrixType, const bool sepCols);
virtual bool connect( const std::string &uuid, const index_type
numRow, ②巨大行列への接続
const index_type numCol, const int matrixType,
const bool sepCols);
③巨大行列の破棄
protected:
virtual bool destroy();

SharedCounter _counter;
};

① 巨大行列の生成
bool SharedMemoryBigMatrix::create( const index_type numRow,
const bool sepCols ) {
#ifndef INTERLOCKED_EXCHANGE_HACK
named_mutex mutex(open_or_create,
(_sharedName+"_counter_mutex").c_str());
mutex.lock();
#endif
_counter.init( _sharedName+"_counter"①－1 カウンタの初期化
);
mutex.unlock();
#endif
switch(_matType) {
// 行列の型に応じた共有用巨大行列の生成
case 1:
_pdata = CreateSharedMatrix<char>(_sharedName, 　　　　
　　
　　　　　　　　　　_dataRegionPtrs, _nrow, _ncol);
break; ①－2 共有用巨大行列の生成エンジン
･･･
}
return true;
}

①-1 カウンタの初期化
bool SharedCounter::init( const std::string &resourceName ) {
_resourceName = resourceName;
try {
// 初めて接続する場合
boost::interprocess::shared_memory_object shm(
boost::interprocess::create_only,
_resourceName.c_str(),
boost::interprocess::read_write);
shm.truncate( sizeof(index_type) );
_pRegion = new boost::interprocess::mapped_region(shm,
boost::interprocess::read_write);
_pVal = reinterpret_cast<index_type*>(_pRegion-
>get_address());
*_pVal = 1;
} catch(std::exception &ex) {
// 既に存在するカウンタに接続する場合
　･･･
++(*_pVal);
}
return true;
}

①-2 共有用巨大行列の生成エンジン
template<typename T>
void* CreateSharedMatrix( const std::string &sharedName,
MappedRegionPtrs &dataRegionPtrs, const index_type nrow, const
index_type ncol)
{ 共有メモリセグメントの作成
shared_memory_object shm(create_only, sharedName.c_str(),

read_write);
共有メモリのサイズの設定
shm.truncate( nrow*ncol*sizeof(T) ); (行列のサイズ分)
dataRegionPtrs.push_back(
MappedRegionPtr(new MappedRegion(shm, read_write)));
return dataRegionPtrs[0]->get_address();
}

② 巨大行列への接続
bool SharedMemoryBigMatrix::connect( const std::string &uuid,
const index_type numRow, const index_type numCol, const int
matrixType,
const bool sepCols )
{
// Attach to the associated mutex and counter;
mutex.lock();
#endif ②－1 カウンタの初期化
_counter.init( _sharedName+"_counter" );
#ifndef INTERLOCKED_EXCHANGE_HACK (①ー1で扱ったため省略
mutex.unlock(); )
#endif
switch(_matType) {
case 1:
_pdata = ConnectSharedMatrix<char>(_sharedName,
_dataRegionPtrs, _counter);
break;
･･･
}
}

②-2 共有用巨大行列への接続エンジン
void* ConnectSharedMatrix( const std::string &sharedName,
MappedRegionPtrs &dataRegionPtrs, SharedCounter &counter)
{
using namespace boost::interprocess; 共有メモリセグメントのオープン
shared_memory_object shm(open_only, sharedName.c_str(),
read_write);
マップド領域への追加
dataRegionPtrs.push_back(
MappedRegionPtr(new MappedRegion(shm, read_write)));
return reinterpret_cast<void*>(dataRegionPtrs[0]->get_address());
}

③ 巨大行列の破棄
bool SharedMemoryBigMatrix::destroy() {
mutex.lock();
#endif
bool destroyThis = (1==_counter.get()) ? true : false;
_dataRegionPtrs.resize(0);
if (destroyThis) {
shared_memory_object::remove(_uuid.c_str());
}
mutex.unlock();
if (destroyThis) {
named_mutex::remove((_sharedName+"_counter_mutex").c_str());
}
#endif
return true;
}

3.4.3 メモリマップドファイルを用いた巨大行列クラス
class FileBackedBigMatrix : public SharedBigMatrix
{
public:
FileBackedBigMatrix():SharedBigMatrix(){}
virtual ~FileBackedBigMatrix(){destroy();}
virtual bool create( const std::string &fileName,
const std::string &filePath,const index_type numRow,
const index_type numCol, const int matrixType, const bool
sepCols);
virtual bool connect( const std::string &fileName,
const std::string &filePath, const index_type numRow,
const index_type numCol, const int matrixType, const bool
sepCols);
std::string file_name() const {return _fileName;}
bool flush();
protected:
virtual bool destroy();

std::string _fileName;
};

① 巨大行列の生成
bool FileBackedBigMatrix::create( const std::string &fileName,
const std::string &filePath, const index_type numRow, const
index_type numCol,
const int matrixType, const bool sepCols)
{
// 行列の型に応じたメモリマップドファイルの生成
switch(_matType) {
case 1: メモリマップドファイル生成エンジン
_pdata = CreateFileBackedMatrix<char>(_fileName, filePath,
_dataRegionPtrs, _nrow, _ncol);
break;
case 2:
_pdata = CreateFileBackedMatrix<short>(_fileName,
filePath,
_dataRegionPtrs, _nrow, _ncol);
break;
･･･
}
return true;
}

template<typename T>
void* ConnectFileBackedMatrix( const std::string &fileName,
const std::string &filePath, MappedRegionPtrs &dataRegionPtrs)
{
ファイルマッピングの作成
file_mapping mFile((filePath+"/"+fileName).c_str(), read_write);
dataRegionPtrs.push_back( メモリ内へのファイルの中身のマッピング
MappedRegionPtr(new MappedRegion(mFile, read_write)));
return reinterpret_cast<void*>(dataRegionPtrs[0]-
>get_address());
}

② 巨大行列への接続
bool FileBackedBigMatrix::connect( const std::string &fileName,
const std::string &filePath, const index_type numRow,
const bool sepCols)
{
// 行列の型に応じたメモリマップドファイルへの接続
switch(_matType) {
case 1:
_pdata = ConnectFileBackedMatrix<char>(_fileName, filePath,
_dataRegionPtrs);
break;
case 2:
_pdata = ConnectFileBackedMatrix<short>(_fileName,
filePath,
_dataRegionPtrs);
break;
･･･
}
return true;
}

③ 巨大行列の破棄
bool FileBackedBigMatrix::destroy()
{
_dataRegionPtrs.resize(0);
shared_memory_object::remove(_fileName.c_str());
return true;
}

3.3 Rの課題の解決度合い
Rの課題解決度合い
マルチCPU(コア)の環境でも ○
基本的に1CPU(コア) 共有メモリやメモリマップドファイルを
用いて並列/並行計算が可能に
○
基本的にオンメモリでデータ
を保持，計算を実行 RAMをはるかに超えるデータの
扱いが可能に
ベクトル，行列，配列などの ○
要素数の上限が231-1 要素数の上限は252まで拡張
基本的にオブジェクトの参照
渡しができず値渡しを行うた ◎
め，コピーがあちこちで発生
しメモリを消費する参照渡しでオブジェクトを渡せる

3.4 具体例
 使用するデータ
 Data Expo 2009
　アメリカの旅客機のフライトデータ
(1987年～2008年)
　　　　http://stat-computing.org/dataexpo/2009/the-data.html
 約12GB(約1億2,300万レコード，29
フィールド)

3.4.1 メモリマップドファイルの作成・接続
> library(bigmemory)
> # メモリマップドファイルの作成(Intel core i7で約21分)
> airline <- read.big.matrix("AirlineAllData.csv",
header=TRUE, sep=",",
backingifle="AirlineAllData.bin",
descriptorfile="AirlineAllData.desc")

> # 既に作成されたメモリマップドファイルに接続(0.002
秒)
> airline <- attach.big.matrix("AirlineAllData.desc")

3.4.2 データの集計
> library(bigtabulate)
>
> # 各列の要約(最小値、最大値、平均値、NAの数)
> summary(airline)
>
> # 年ごと月ごとのフライト数
> bigtable(airline, c("Year", "Month"))
>
> # 曜日ごとの到着時間の遅れの統計量
> # (最小値、最大値、平均値、標準偏差、NAの数)
> bigtsummary(airline, "DayOfWeek", cols="ArrDelay", na.rm=T)

3.4.3 旅客機の製造月の推定
> library(bigtabulate)
>
> # 旅客機コードごとのレコード番号
> planeindices <- bigsplit(x, 'TailNum')
>
> # 2コアを使って並列に実行する
> library(doMC)
> registerDoMC(cores=2)
>
> # 製造月の推定(約14秒)
> planeStart <-
+ foreach(i=planeindices, .combine=c) %dopar% {
+ return(birthmonth(x[i, c('Year','Month'),
+ drop=FALSE]))
+}

 データの保持，集計程度はできるようには
なったが，まだまだ機能が不十分
 機能を拡張するためには，単一の型しか扱
えない行列ではダメ
 列ごとに型が異なることを許容するデータ
フレームを開発する必要がある
 Boost.VariantやBoost.MPL等を用いて開
発できないか検討中

 Rcppパッケージを用いたRとC++のインタ
フェースの簡潔な記述(Boost.Pythonを参
考)
 Boost.Graphライブラリを呼び出すRBGL
パッケージ
 Boost.Date_Timeライブラリを呼び出す
RcppBDTパッケージ　等々

アジェンダ
1．Why R?
2. What's R?
3．Boost.Interprocessを活用したRの大規
模データ管理
4. その他C++/BoostとRの接点
5. まとめ

 Rの機能を拡張するために，いろいろなとこ
ろでBoostが使われています
 データ分析屋が快適に分析を行うためにも
，Boostコミュニティの益々のご発展を願っ
ています！

統計解析言語Rにおける大規模データ管理のためのboost.interprocessの活用

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie 統計解析言語Rにおける大規模データ管理のためのboost.interprocessの活用

Ähnlich wie 統計解析言語Rにおける大規模データ管理のためのboost.interprocessの活用 (20)

Mehr von Shintaro Fukushima

Mehr von Shintaro Fukushima (14)

Kürzlich hochgeladen

Kürzlich hochgeladen (11)

統計解析言語Rにおける大規模データ管理のためのboost.interprocessの活用