76. I collect my data at different locations all over the world , but want to do statistical analysis in my headquarters or at one other location . How would I do that?
77. Aggregate your data and push it e.g. once a day out to the cloud . That’s a sort of replication if you like
78. Choose a cloud based data store which can store big objects . The store should provide similar consistency characteristics as your local data store
79. Try AWS (S3) or Rackspace (OpenStack/Swift) or a private cloud . They are either directly Dynamo based or implement similar concepts
80. That’s a lot of data and distribution. I need to quickly push it from a location into the cloud while data keeps coming in . How would I do that?
81. Use MapReduce to distribute the aggregation job to a group of nodes in order to quickly get the overall aggregation and cloud storage done
82. Map , sort , combine and reduce to whatever representation you need
83. Separate MapReduce splitting, jobs and intermediate storage from the local store to keep them independent and thus to read local store snapshots while still writing the new data
84. Try Hadoop . It implements MapReduce with an own file system (HDFS), distribution etc. It is highly extensible
85. I need to do some statistical analysis and visualize my data. How would I do that?
86. Choose a general purpose platform for statistical computing and graphics
87. Try R . It allows statistical analysis of whatever type of data and its graphical plotting. It’s highly extensible
88. And these are only some of the possible Q&As. There are more areas such as NoSQL , content preparation for CDN , data mining etc. which we didn’t consider
89. Know your data your scenarios how to scale the technology when to stop
90. The experiment – live demo Source code will be available on http://github.com/pavlobaron Detail description will be available on http://archi-jab-ture.blogspot.com
93. We store votes as they come with sloppy write quorum. We store on several nodes in a regional cluster. We match patterns on a stream, not on saved data. We push aggregated day data from regions to the cloud using distributed MapReduce. We use scalable, distributed components with HA options. Etc. Does it scale ?