Iaetsd secured and efficient data scheduling of intermediate data sets
1. Secured and Efficient Data Scheduling of Intermediate Data Sets
in Cloud
D.TEJASWINI.,M.TECH
C.RAJENDRA.,M.TECH.,M.E.,PH.D.
AUDISANKARA COLLEGE OF ENGINEERING & TECHNOLOGY
ABSTRACT
Cloud computing is an emerging field in the development of business and
organizational environment. As it provides more computation power and storage space users
can process many applications.Due to this large number of intermediate datasets are
generated and Encryption and decryption techniques are used to preserve the intermediate
data sets in cloud. An Upper bound constraint based approach is used to identify sensitive
intermediate data sets and we apply suppression technique on sensitive data sets in order to
reduce the time and cost. The Value Generalization Hierarchy protocol is used to achieve
more security so that number of users can access the data with privacy.Along with that
Optimized Balanced Scheduling is also used for the best mapping solution to meet the system
load balance to the greatest extent or to reduce the load balancing cost
The Privacy preservation is also ensured with dynamic data size and access
frequency values. Storage space and computational requirements are optimally utilized in the
privacy preservation process. Data distribution complexity is also handled in the scheduling
process.
Keywords: Cloud computing, privacy upper bound, intermediate data sets, optimized
balanced scheduling, value generalization hierarchy protocol.
1.INTRODUCTION
Cloud computing mainly relies on sharing
of resources to achieve coherence and
economies of scale similar to a utility over
a network. The basement of cloud
computing is the broader concept of
converged infrastructure and shared
services. The cloud mainly focuses on
maximizing the effectiveness of shared
resources. Cloud resources are not only
shared for multiple users but also
dynamically reallocated per demand. The
privacy issues [12] caused by retaining
intermediate data sets in cloud are
important but they were paid little
attention. For preserving privacy v[9] of
multiple data sets, we should anonymize
all data sets first and then encrypt them
before storing or sharing them in cloud.
Usually, the weightage of intermediate
data sets[11] is huge. Users will store only
347
INTERNATIONAL CONFERENCE ON CURRENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT
ISBN: 378 - 26 - 138420 - 5
www.iaetsd.in
2. important datasets on cloud when
processing original data sets in data-
intensive applications such as medical
diagnosis[16], in order to reduce the
overall expenses by avoiding frequent re-
computation to obtain these data sets. Such
methods are quite common because data
users often re-analyse results, conduct new
analysis on intermediate data sets, and also
share some intermediate results with others
for collaboration.
Data Provenance is employed to manage
the intermediate datasets. A number of
tools for capturing provenance have been
developed in workflow systems and a
standard for provenance representation
called the Open Provenance Model (OPM)
has been designed.
2.RELATED WORK.
Encryption is usually integrated with other
methods to achieve cost reduction, high
data usability and privacy protection. Roy
et al. [8] investigated the data privacy
problem caused by Map Reduce and
presented a system named Airavat which
incorporates mandatory access control
with differential privacy. Puttaswamy et al.
[9] described a set of tools called
Silverline which identifies all encryptable
data and then encrypts them to protect
privacy. Encrypted data on the cloud
prevent privacy leakage to compromisedor
malicious clouds, while users can easily
access data by decrypting data locally with
keys from a trusted organization. Using
dynamic program analysis techniques
Silverline automatically identifies the
encryptable application data that can be
safely encrypted without negatively
affecting the application functionality. By
modifying the application runtime, e.g. the
PHP interpreter, we show how Silverline
can determine an optimal assignment of
encryption keys that minimizes key
management overhead and impact of key
compromise. Our applications running on
the cloud can protect their data from
security breaches or compromises in the
cloud. While our work provides a
significant first step towards Zhang et
al.[10] proposed a system named Sedic
which partitions Map Reduce computing
jobs in terms of the security labels of data
they work on and then assigns the
computation without sensitive data to a
public cloud. The sensitivity of data is
required to be labelled in advance to make
the above approaches available. Ciriani et
al.[10] has proposed an approach that
combines the encryption and data
fragmentation to achieve the privacy
protection for distributed data storage with
encrypting only part of data sets.
3.SYSTEM ARCHITECTURE
Fig 1: System Architecture For Secure
Transaction Using The Cloud
Our approach mainly will work by
automatically identifying the subsets of an
application’s data that are not directly used
in computation, and exposing them to the
cloud only in encrypted form.
• We present a technique to partition
encrypted data into parts that are accessed
by different sets of the users (groups).
Intelligent key assignment limits the
damage which is possible from a given key
compromise, and strikes a good trade off
between robustness and key management
complexity.
• We present a technique that enables
clients to store and use their keys safely
while preventing cloud-based service from
stealing the keys. Our solution works
348
INTERNATIONAL CONFERENCE ON CURRENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT
ISBN: 378 - 26 - 138420 - 5
www.iaetsd.in
3. today on unmodified web browsers. There
are many privacy threats caused due to the
intermediate data sets so, we need to
encrypt these data sets to provide privacy
and make them secure.
Fig.2: A Scenario Showing Privacy
Threats Due To Intermediate Datasets
4.IMPLEMENTATION
4.1 Requirements
The problem of managing the intermediate
data which is generated during dataflow
computations, deserves deeper study as a
first-class problem. They are the following
two major requirements that any effective
intermediate storage system needs to
satisfy: availability of intermediate data,
and minimal interference on foreground
network traffic generated by the dataflow
computation.
Data Availability: A task which is in a
dataflow stage cannot be executed if the
intermediate input data is unavailable. A
system that provides higher availability for
intermediate data will suffer from fewer
delays for re-executing tasks in case of
failure. In multi-stage computations, high
availability is critical as it minimizes the
effect of cascaded re-execution.
Minimal Interference: At the same time,
the data availability cannot be pursued
over-aggressively. In particular, since
intermediate data is used immediately, and
there is high network contention for
foreground traffic of the intermediate data
transferred to the next stage. So an
intermediate data management system
needs to minimize interference.
4.2 Privacy Preserved Data Scheduling
Scheme
Here multiple intermediate data set privacy
models is combined with the data
scheduling mechanism. The Privacy
preservation is ensured with the dynamic
data size and access frequency values and
along with that Storage space and
computational requirements are optimally
utilized in the privacy preservation process
and Data distribution complexity is also
handled in the scheduling process. The
Data sensitivity is considered in the
intermediate data security process.
Resource requirement levels are monitored
and controlled by the security operations.
The system is divided into five major
modules. They are data center, data
provider, intermediate data privacy,
security analysis and data scheduling. The
data center maintains the encrypted data
values for the providers. Shared data
uploading process are managed by the data
provider module. The Intermediate data
privacy module is designed to protect
intermediate results. Security analysis
module is designed to estimate the
resource and access levels. Original data
and intermediate data distribution is
planned under the data scheduling module.
Dynamic privacy management and
scheduling mechanism are integrated to
improve the data sharing with security.
Privacy preserving cost is reduced by the
joint verification mechanism.
4.3 Analysis of the Cost Problem
A cloud service provides various pricing
models to support the pay-as-you-go
model, e.g., Amazon Web Services pricing
model[4]. The Privacy-preserving cost of
the intermediate data sets can be reduced
from frequent encryption or decryption
with charged cloud services which needs
more computation power, data storage, and
other cloud services. To avoid the pricing
349
INTERNATIONAL CONFERENCE ON CURRENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT
ISBN: 378 - 26 - 138420 - 5
www.iaetsd.in
4. details and to focus on, combine the prices
of various services required by encryption
or decryption into one.
4.4 Proposed Framework
The technique or the new protocol which
we use for privacy protection here is the
Value Generalization Hierarchy Protocol
which has the functionality of assignment
of the common values for the unknown
and original data values for general
identification later on we add of full
suppression on the more important data
sets which enhances the complete
encryption of the entire data sets given. To
Investigate privacy aware and efficient
scheduling of intermediate data sets for
minimum cost and fast computation.
Suppression of data is done to reduce the
overall computation time and cost and
where VGH Protocol is also proposed to
achieve it. Here we secure the more
important dataset though semi suppression
only. The full suppression to achieve the
high privacy or security of original data
sets and the original data set is only
viewed by owner. Here number user can
access the data with security and to avoid
privacy leakage. The privacy protection
cost for intermediate data sets that needs to
be encoded while using an upper bound
constraint-based approach to select the
necessary subset of intermediate data sets.
The privacy concerns caused by retaining
intermediate data sets in cloud are
important Storage and computation
services in cloud are equivalent from an
economical perspective because they are
charged in proportion to their usage.
Existing technical approaches for
preserving the privacy of datasets stored in
cloud mainly include encryption and
anonymization. On one hand, encrypting
all data sets, a straightforward and
effective approach, is widely adopted in
current research. However, processing on
encrypted data sets efficiently is quite a
challenging task, because most existing
applications only run on unencrypted data
sets. Thus, for preserving privacy of
multiple data sets, it is promising to
anonymize all data sets first and then
encrypt them before storing or sharing
them in cloud. Usually, the volume of
intermediate data sets is huge. Data sets
are divided into two sets. One is sensitive
intermediate data set and another is non-
sensitive intermediate data set. Sensitive
data set is denoted as SD then non
sensitive data set is denoted as NSD. The
equations, sd U NSD =D and SD ∩ NSD
=Ф hold. The pair of (SD, NSD) as a
global privacy preserving of cloud data.
Suppression technique done only on
sensitive data sets in two ways such as
semi suppression and full suppression,
while full suppression apply on most
important sensitive intermediate data set
that is individual data set value fully
encoded then semi suppression apply on
selective sensitive data sets that is half of
the data set value will be encoded. Also
propose Value Generalization Hierarchy
(VGH) protocol to reduce cost of data
4.5 Optimized Balanced Scheduling
The optimized balanced scheduling is used
for the best mapping solution to meet the
system load balance to the greatest extent
or to lower the load balancing cost. The
best scheduling solution for the current
scheduling process can be done through
genetic algorithm. First we need to
compute the cost through the ratio of the
current scheduling solution to the best
scheduling solution, and then we have to
make the best scheduling strategy
according to the cost. So that it has the
least influence on the load of the system
after scheduling and it has the lowest cost
to reach load balancing. In this way, we
can form the best scheduling strategy.
5. CONCLUSION
In this paper, focus is mainly
contributed towards identification of the
areas where the most sensitive
intermediate datasets are present in cloud.
An upper bound constraint based approach
is used where data sets needs to be
350
INTERNATIONAL CONFERENCE ON CURRENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT
ISBN: 378 - 26 - 138420 - 5
www.iaetsd.in
5. encoded, in order to reduce the privacy
preserving cost so we investigate privacy
aware efficient scheduling of intermediate
data sets in cloud by taking privacy
preserving as a metric together with other
metrics such as storage and computation.
Optimized balanced scheduling strategies
are expected to be developed toward
overall highly efficient privacy aware data
set scheduling and mainly in the overall
time reduction and Data delivery overhead
is reduced by the load balancing based
scheduling mechanism. Dynamic privacy
preservation model is supported by the
system and along with that a high security
provisioning is done with the help of full
suppression, semi suppression and Value
Generalization Hierarchy Protocol. This
protocol is used to assign the common
attribute for different attributes and the
Resource consumption is also controlled
by the support of the sensitive data
information graph.
6.REFERENCES
[1] L. Wang, J. Zhan, W. Shi, and Y.
Liang, “In Cloud, Can Scientific
Communities Benefit from the Economies
of Scale?,” IEEE Trans. Parallel and
Distributed Systems, vol. 23, no. 2, pp.
296-303, Feb.2012.
[2] Xuyun Zhang, Chang Liu, Surya
Nepal, Suraj Pandey, and Jinjun Chen, “A
Privacy Leakage Upper Bound Constraint-
Based Approach for Cost-Effective
Privacy Preserving of Intermediate Data
Sets in Cloud”, IEEE Transactions On
Parallel And Distributed Systems, Vol. 24,
No. 6, June 2013.
[3] D. Zissis and D. Lekkas, “Addressing
Cloud Computing Security Issues,” Future
Generation Computer Systems, vol. 28, no.
3, pp. 583- 592, 2011.
[4] D. Yuan, Y. Yang, X. Liu, and J. Chen,
“On-Demand Minimum Cost
Benchmarking for Intermediate Data Set
Storage in Scientific Cloud Workflow
Systems,” J. Parallel Distributed
Computing, vol. 71, no. 2, pp. 316-332,
2011.
[5] K. Zhang, X. Zhou, Y. Chen, X. Wang,
and Y. Ruan, “Sedic: Privacy-Aware Data
Intensive Computing on Hybrid Clouds,”
Proc. 18th ACM Conf.
Computer and Comm. Security (CCS ’11),
pp. 515-526, 2011.
[6] H. Lin and W. Tzeng, “A Secure
Erasure Code-Based Cloud Storage
System with Secure Data Forwarding,”
IEEE Trans. Parallel and Distributed
Systems, vol. 23, no. 6, pp. 995-1003, June
2012.
[7] G. Wang, Z. Zutao, D. Wenliang, and
T. Zhouxuan, “Inference Analysis in
Privacy-Preserving Data Re-Publishing,”
Proc. Eighth IEEE Int’l Conf.
Data Mining (ICDM ’08), pp. 1079-1084,
2008.
[8] K.P.N. Puttaswamy, C. Kruegel, and
B.Y. Zhao, “Silverline: Toward Data
Confidentiality in Storage-Intensive Cloud
Applications,” Proc. Second
ACM Symp. Cloud Computing (SoCC
’11), 2011.
[9] I. Roy, S.T.V. Setty, A. Kilzer, V.
Shmatikov, and E. Witchel, “Airavat:
Security and Privacy for Mapreduce,”
Proc. Seventh USENIX Conf.
Networked Systems Design and
Implementation (NSDI ’10), p. 20, 2010.
[10] X. Zhang, C. Liu, J. Chen, and W.
Dou, “An Upper-Bound Control Approach
for Cost-Effective Privacy Protection of
Intermediate Data Set Storage
in Cloud,” Proc. Ninth IEEE Int’l Conf.
Dependable, Autonomic and Secure
Computing (DASC ’11), pp. 518-525,
2011.
[11] B.C.M. Fung, K. Wang, R. Chen, and
P.S. Yu, “Privacy-Preserving Data
Publishing: A Survey of Recent
Developments,” ACM Computing Survey,
vol. 42, no. 4, pp. 1-53, 2010.
[12] H. Lin and W. Tzeng, “A Secure
Erasure Code-Based Cloud Storage
System with Secure Data Forwarding,”
IEEE Trans. Parallel and Distributed
Systems, vol. 23, no. 6, pp. 995-1003, June
2012.
351
INTERNATIONAL CONFERENCE ON CURRENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT
ISBN: 378 - 26 - 138420 - 5
www.iaetsd.in