SlideShare a Scribd company logo
1 of 16
Project I – BigData




CMPE 226 – Database Design



       Submitted by

     Manjula Kollipara

     Roopa Penmetsa

    Sumalatha Elliadka

     Sridhar Srigiriraju



      Project Advisor

      Prof. John Gash




       October, 2012
Abstract
The goal of this project is to understand the ORM and JDBCmethodologies and explore design
considerations for managing big data.

Introduction
Our project is implemented for handling huge data sets and measuring the performance . The design
considerations are implemented using both ORM and JDBC frameworks. In order to compare the
performance results, we tried to have similar implementation design and resources.
The data set for this project has weather details downloaded from the internet. The data has three
files, but we decided to ignore one of them which is a sub-set of another file. Hereon, the main
files are going to be referenced asStation and Forecast. Though the Station data is mostly
constant, the forecast data grows dynamically.

Implementation
In this project, we downloaded the weather data from the internet using shell script and collected
data for 4 days. Approximately 2 GB of data is collected and loaded to the database using file
reading mechanisms. The following table provides the general overview of implementation
details.
                   Implementation Details

                   Files                    DB Table names

                   mesowest_csv.tbl         Station

                   mesowest.out             Forecast

                   Number of Tuples

                   Station                  ~3.4 M

                   Forecast                 ~3.5M

                   System Set-up

                   Number of Cores           2.9 GHz Intel Core i7

                   Memory
                                             8GB
Tools & Technology

                    IDE                       Eclispe

                    Tools                     ORM, JDBC,

                    Progamming Languages      Java, XML, SQL

                    Testing Strategies        Unit test case, HQL, Criteria



Performance checks
        Loading data:
        The script downloads the data into separate folders with the corresponding time stamp.
        All the projects load data from these folders and store into the database. This way we
        tried to compare the load time of all the designs.


        Insert/update/retrieve operations on data:
        These operations are encountered while loading the data and can be significantly affected
        by the database design aspects like relationships and constraints.


        Search or find operation on data:
        The test queries focus mainly on the speed in fetching data based on few search criteria.


        SQL vs HQL:
        By implementing queries in test cases using SQL/HQL, we tried to compare the
        performance based on the query language specific to JDBC and Hibernate.



Design/ Approaches considered:

   1.    Denormalized Data
           To compare the load performance
           Performed basic clean up: Redundancy of data, invalid data
           Station and Forecast data is just loaded from the folders into database
No relationships in this design
      Station and Forecast tables do not hold constraints or integrity rules on any columns




   Observations


   JDBC
      Provides COPY command that allows user to dump the data from file into database.
      If no integrity rules adopted, then COPY command breaks and throws exception
      In this project, base version inserts tuple by tuple into the database.
      This method requires hitting database every time inserts are performed.


   Hibernate
      ORM allows traversing object by object.
      Developer can control the number of hits to database (Can perform batch updates).
      Does not provide an option load all 3.X M of data at a time into the database.
      In this project data is loaded in batches of size 100 records at a time.

2. Normalized data
      To analyze the effects of normalization on huge data
      Removed redundant MNET, SLAT, SLON, SELV columns from Forecast table
      Assumption: StationprimaryId is unique across all the Stations.
Observations


JDBC:
   IntroducesStationId to avoid duplicate Station information in the database.
   Normalized the Forecast table because it is dynamic.
   Implemented foreign key in Forecast referring to Station. For each Forecast entry, the
   corresponding ‘StationId’ has to be looked up from Station.
   Implemented    stored   procedure   to   implement    ‘SaveOrUpdate’   strategy.   It   also
   ensuresreferential constraints are not violated
   Implementation is comparatively simple.


Hibernate:
   Implemented OneToMany between Station and Forecast table. The data is loaded to
   both tables simultaneously by associating Station object with Forecast object during
   runtime. This approach involved alot of file searching time and resulted in
   performance degradation.
   Although all the fields are stored as string a better approach would be to use
   appropriate data types for double and date fields.
   Developer had control over the Database accessibility.
   Implementation is complex, since hibernate had many hidden/background
   implementation. For example, extra logic needs to be implemented to map a detached
   object with current session.
3. Indexing
       To analyze the search query performance in test cases
       To analyze the load performance when columns are indexed
       Indexed frequently used columns in the test cases. In project we tried to implement
       indexing on temperature field in Forecast table.
       Implicitly created indexes on primary keys by postgres are not considered for the
       difference in performance since they exist in the non-indexed version too.
       This method is implemented only for hibernate version.


   Observation
   Indexing resulted in extra loading time but search queries were faster. But, we tested this
   design on a separate system and the results were not comparison compatible.


4. Partitioning
       Performed Partition in Forecast rather on Stations. Since size of Stations was small and
       mostly constant, not much benefit in partitioning it. Since Forecast is more dynamic and
       keeps increasing, more benefit in partitioning it.
       Implemented partitioning in both JDBC and Hibernate.
       Number of Forecast Partitions = 2
           o   Assumed primaryId to be unique across all the Stations.
           o   Location algorithm: Determines the Forecast Partition# using a hash on the Station’s
               primaryId.
       Due to lack of time, we could not get through the testing performance numbers.


   Observations


   JDBC:

       Number of partitions can be changed by changing a single variable in JDBCUtil.Java
       Uses ForeignKey relation from Forecast to Station.
       Implemented using a single Forecast class, but the JDBC query is executed on the
       appropriately selected partition
Hibernate:

        Number of partitions can be changed by changing a single variable in HibernateUtil.Java
        Uses @ManyToOne relation from Forecast to Station.
        Implemented using an abstract base Forecast class and two child classes Forecast1, Forecast2
        that represent the 2 partitons.
        Initially used @Inheritance(strategy=InheritanceType.TABLE_PER_CLASS) strategy and
        then moved on to @MappedSuperclass strategy.



Partitioning served as a good exercise before moving on to sharding.



5. Sharding
        To analyze the performance of the normalized version when sharded into 2 databases
        Implemented sharding in both JDBC and Hibernate.
        Number of shards = 2
            o    Assumed primaryId to be unique across all the Stations.
            o    Location algorithm: Determines the Shard# using a hash on the Station’s primaryId.
        Multiple postgres instances on the same host.




    Observations
JDBC:
    Developer had to think from database designer perspective to design and implement
    sharding methodology.
   Number of shards can be changed by changing a single variable in JDBCUtil.Java
   Uses ForeignKey relation from Forecast to Station.
   Implemented using a single Station and Forecast class, but the JDBC query is executed using
   the connection to the appropriately selected instance.



Hibernate:
    Found implementation process was very developer friendly.
    Implemented ManyToOne strategy for creating relationships. Also, the Station table
    is loaded first and then Forecast data is entered to database.
    The relationship is mapped by retrieving Station object from database and mapped
    to Forecast details while inserting Forecast data. This avoided the file searching and
    mapping time and provided performance gain.
    Proper data types are used for the columns in the database.
    Number of shards can be changed by changing a single variable in HibernateUtil.Java
    Uses @ManyToOne relation from Forecast to Station.
    Implemented using a single Station and Forecast class, but the object is saved using the
    session to the appropriately selected instance.
Test cases
Three types of search queries are written to test the query performance of various types
approaches we used in this project. The following section describes the various approaches used
by us to improve the performance.


       For a given ‘Stationid’: Fetches related records from Stationand Forecast tables.
       For a given temperature range: Fetches all the related StationandForecast details.
       For a given Forecast time: Fetches all the relatedStation and Forecast details.



Performance comparisontable:

                                                      JDBC                     Hibernate
   Denormalization*       Loading(in hrs)                10                        1.3
                                                        194                        986
                             Test cases
                                                      10500                       4099
                             (in msecs)
                                                        780                       1033
     Normalization*       Loading(in hrs)                 3                          6
                                                        113                       2638
                             Test cases
                                                       7747                      16863
                             (in msecs)
                                                        768                       1231
       Indexing**         Loading(in hrs)                 -                         11
                                                                                119000
                             Test cases
                                                         -                       60000
                             (in msecs)
                                                                                  2000
       Sharding*          Loading(in hrs)              2.2                        2.45
                                                       127                        1471
                             Test cases
                                                      6363                      490267
                             (in msecs)
                                                       885                        2539

*Performed on a Macbook with 8GB RAM
**Performed on a PC with 4GB RAM
Performance comparison graphs

                              Design vs. Load Performance (in hrs)

              10
               9
               8
               7
               6
                                                                          JDBC
               5
               4                                                          HIBERNATE
               3
               2
               1
               0
                   Denormalized     Normalized      Sharding



Denormalized:
   1. In JDBC the records are loaded to the database tuple by tuple. This involves hitting the
       database each time and resulted in huge performance loss.
   2. In Hibernate data is loaded in batches. Since no constraints were there, that data is loaded
       faster and took less write time.
Normalized:
   1. In JDBC since the data is normalized in Forecast table, the load performance is
      drastically improved.
   2. In hibernate: since we tried to load both Station and Forecast data simultaneously, it
      required reading all the record in the Forecast file for each Station. This resulted in high
      file reading time and resulted in performance loss. Also, hibernate is slower than JDBC
      because of addition ORM layer.
Sharding:
   1. The load time is faster since the data is distributed and results in database operations on
      comparatively lesser set of records
Design vs. Query1 Performance (in msecs)




                3000

                2500

                2000
                                                                          JDBC
                1500
                                                                          HIBERNATE
                1000

                500

                   0
                        Denormalized   Normalized    Sharding



Query1: Finding all the details(including Forecasts) related to the Station by passing Station Id
as parameter.


Denormalized:
       1. In JDBC join is faster than ORM HQL join since, JDBC is directly accessing the
            database.


Normalized:
       1. In JDBC join is faster than ORM Lazy fetching since, JDBC is directly accessing the
            database.
       2. For a normalized version, the query performance can be improved by Egar fetching.


Sharding:
       1. In both JDBC and Hibernate, the performance is improved, since less number records
       are involved in searching the data.
Design vs. Query2 Performance (in msecs)


             500000
             450000
             400000
             350000
             300000
                                                                         JDBC
             250000
             200000                                                      HIBERNATE
             150000
             100000
              50000
                   0
                       Denormalized    Normalized    Sharding




Query2: Fetching all the records false between the temperature range 80 and 90.


   1. This query has high response time due to the larger amount records (~2.3 million)
       retrieved from database.
   2. JDBC Query performs faster in normalized version due to foreign key constraint. And
       performance is improved in sharding due to less searching involved.
   3. Hibernate HQL queries are faster in denormalized version since there is less mapping
       involved irresptive of query type.
   4. The response time increased for hibernate sharding since, data is fetched from both
       shards and an inner query needs to be performed to get the mapped ids. Also ORM
       mapping time increases due to extra sessions and transactions involved.
Design vs. Query3 Performance (in msecs)


              2500


              2000


              1500
                                                                         JDBC

              1000                                                       HIBERNATE


               500


                 0
                     Denormalized    Normalized      Sharding



Query: Fetching all the records for a particular timestamp

The reasons for the performance variations in this graph are same as the query2. The decrease in
time is due to the lesser number of records returned compared to the query2.
Lessons Learnt:
      Manually reading from each file takes lot of time.
      Hibernate requires less lines of code compared to JDBC. But the analysis and usability of
      hibernate is much more complex compared to JDBC.
      Hibernate demands a steep learning curve due to the complicated background processes.
      Easy to implement using hibernate when there is a database change compared to JDBC.
      The future work of this project includes comparing the performance by introducing
      second level cache to hibernate and caching mechanisms for JDBC.
      Avoid triggers, indexes while loading the data.
      The partitions or shards should be distributed as uniformly as possible.
       Hibernate provides a lot of flexibility in writing queries compared to JDBC. Criteria are
      useful when the query needs to be generated dynamically. Also the various fetching
      strategies (eager/lazy fetching) provide the flexibility to frame the selection process.
      Old paradigm with JDBC, but easier to debug.New paradigm with Hibernate, tough to debug.
      Thinking and working with objects using serialization is better than thinking and working with
      SQL query strings.
      Using Criteria to perform queries is powerful enough to retrieve related objects. This required
      explicit joins or multiple queries in JDBC.
      Using Hibernate, the table structures were controlled directly from object definition, provided
      better control and flexibility to recreate the tables. Using JDBC this requires matching the SQL
      structure with the Java objects/code.
      Sessions are an important concept in Hibernate that improves performance using batching and
      caching. In JDBC, caching is explicitly coded by developer.
      There is no ‘SaveOrUpdate’ concept in JDBC/SQL queries. This has to be implemented using a
      stored procedure that performs an ‘update’ else ‘insert’. Hibernate provides this flexibility.
      JDBC can have update conflicts when two threads can simultaneously update a record, unless
      extra version information is embedded in each query by the developer. Using Hibernate’s
      @version it is easy to achieve this behavior.
      Using composite keys is a bit more work in Hibernate, but easy and flexible.
      Explicit ResultSet to object conversion using JDBC. Using Hibernate, this happens implicitly.
      Using JPA Relationships provides lot of flexibility in embedding related objects and expressing
      relations between them.
Instructions to run program

1. mesowest.sh is used to download files from the internet.
           Open it with an editor and change the variable ‘arch’ to the path you want store the
           downloaded data.
2. For all the project folders make the following changes to DirectoryRead.java file:
           Change the path of variable homePath to the path where all downloaded folders are kept.
           Optionally, this path can be provided using the 1st command-line argument.
           Change the path of variable archivePath to the path to where you want to move the data after
           reading from file. Optionally, this path can be provided using the 2nd command-line argument.
3. For all hibernate projects make the following changes to hibernate.cfg.xml file:
            Edit the configuration details.
            Optionally, an alternate configuration file can be provided using the 3rd command-line
            argument.
4. For all JDBC related projects make the following changes to JDBCUtil.java file:
            Change the database connection details.
            By default, 1st instance is at localhost:5432 and 2nd instance is at localhost:5433.
5. For all JDBC related projects run .sql files to create the database schema.
6. In AutomatedBaseVersion, we implemented the automation of directly downloading data
    from the website and load it to the database.


Contribution

   Manjula Kollipara
        o JDBC Normalization, JDBC Partitioning, JDBC Horizontal Sharding
        o Hibernate Partitioning, Hibernate Horizontal Sharding
        o Jar& Ant builds, Database design
   Roopa Penmetsa
        o Hibernate indexing version
        o Test cases for all hibernate versions
        o Ant builds
   Sumalatha Elliadka
        o Hibernate base version, Hibernate normalization.
o JDBCJUnit library functions
       o Test scripts
  Sridhar Srigiriraju:
       o Created shell scripts to automate the data download process
       o Database design



References
1. http://www.mkyong.com/hibernate/
2.http://docs.jboss.org/hibernate/orm/4.0/devguide/en-US/html_single/
3.https://forum.hibernate.org/viewtopic.php?f=1&t=966223

More Related Content

What's hot

123448572 all-in-one-informatica
123448572 all-in-one-informatica123448572 all-in-one-informatica
123448572 all-in-one-informaticahomeworkping9
 
The Performance of MapReduce: An In-depth Study
The Performance of MapReduce: An In-depth StudyThe Performance of MapReduce: An In-depth Study
The Performance of MapReduce: An In-depth StudyKevin Tong
 
Row level security in enterprise applications
Row level security in enterprise applicationsRow level security in enterprise applications
Row level security in enterprise applicationsAlexander Tokarev
 
Hot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark frameworkHot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark frameworkSupriya .
 
PyCon DE 2013 - Table Partitioning with Django
PyCon DE 2013 - Table Partitioning with DjangoPyCon DE 2013 - Table Partitioning with Django
PyCon DE 2013 - Table Partitioning with DjangoMax Tepkeev
 
SQL Optimization With Trace Data And Dbms Xplan V6
SQL Optimization With Trace Data And Dbms Xplan V6SQL Optimization With Trace Data And Dbms Xplan V6
SQL Optimization With Trace Data And Dbms Xplan V6Mahesh Vallampati
 
An application classification guided cache tuning heuristic for
An application classification guided cache tuning heuristic forAn application classification guided cache tuning heuristic for
An application classification guided cache tuning heuristic forKhyati Rajput
 
Cost Based Optimizer - Part 1 of 2
Cost Based Optimizer - Part 1 of 2Cost Based Optimizer - Part 1 of 2
Cost Based Optimizer - Part 1 of 2Mahesh Vallampati
 
Implementation of query optimization for reducing run time
Implementation of query optimization for reducing run timeImplementation of query optimization for reducing run time
Implementation of query optimization for reducing run timeAlexander Decker
 
JPA and Coherence with TopLink Grid
JPA and Coherence with TopLink GridJPA and Coherence with TopLink Grid
JPA and Coherence with TopLink GridJames Bayer
 
AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER WR...
AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER  WR...AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER  WR...
AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER WR...Vijay Prime
 
Scalable scheduling of updates in streaming data warehouses
Scalable scheduling of updates in streaming data warehousesScalable scheduling of updates in streaming data warehouses
Scalable scheduling of updates in streaming data warehousesIRJET Journal
 
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...IRJET Journal
 
Cost Based Optimizer - Part 2 of 2
Cost Based Optimizer - Part 2 of 2Cost Based Optimizer - Part 2 of 2
Cost Based Optimizer - Part 2 of 2Mahesh Vallampati
 
Row Level Security in databases advanced edition
Row Level Security in databases advanced editionRow Level Security in databases advanced edition
Row Level Security in databases advanced editionAlexander Tokarev
 
A BRIEF REVIEW ALONG WITH A NEW PROPOSED APPROACH OF DATA DE DUPLICATION
A BRIEF REVIEW ALONG WITH A NEW PROPOSED APPROACH OF DATA DE DUPLICATIONA BRIEF REVIEW ALONG WITH A NEW PROPOSED APPROACH OF DATA DE DUPLICATION
A BRIEF REVIEW ALONG WITH A NEW PROPOSED APPROACH OF DATA DE DUPLICATIONcscpconf
 
IRJET- Enhanced Density Based Method for Clustering Data Stream
IRJET-  	  Enhanced Density Based Method for Clustering Data StreamIRJET-  	  Enhanced Density Based Method for Clustering Data Stream
IRJET- Enhanced Density Based Method for Clustering Data StreamIRJET Journal
 
In Defense Of Core Data
In Defense Of Core DataIn Defense Of Core Data
In Defense Of Core DataDonny Wals
 

What's hot (20)

123448572 all-in-one-informatica
123448572 all-in-one-informatica123448572 all-in-one-informatica
123448572 all-in-one-informatica
 
The Performance of MapReduce: An In-depth Study
The Performance of MapReduce: An In-depth StudyThe Performance of MapReduce: An In-depth Study
The Performance of MapReduce: An In-depth Study
 
Row level security in enterprise applications
Row level security in enterprise applicationsRow level security in enterprise applications
Row level security in enterprise applications
 
Hot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark frameworkHot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark framework
 
PyCon DE 2013 - Table Partitioning with Django
PyCon DE 2013 - Table Partitioning with DjangoPyCon DE 2013 - Table Partitioning with Django
PyCon DE 2013 - Table Partitioning with Django
 
SQL Optimization With Trace Data And Dbms Xplan V6
SQL Optimization With Trace Data And Dbms Xplan V6SQL Optimization With Trace Data And Dbms Xplan V6
SQL Optimization With Trace Data And Dbms Xplan V6
 
An application classification guided cache tuning heuristic for
An application classification guided cache tuning heuristic forAn application classification guided cache tuning heuristic for
An application classification guided cache tuning heuristic for
 
Cost Based Optimizer - Part 1 of 2
Cost Based Optimizer - Part 1 of 2Cost Based Optimizer - Part 1 of 2
Cost Based Optimizer - Part 1 of 2
 
Implementation of query optimization for reducing run time
Implementation of query optimization for reducing run timeImplementation of query optimization for reducing run time
Implementation of query optimization for reducing run time
 
11i Logs
11i Logs11i Logs
11i Logs
 
JPA and Coherence with TopLink Grid
JPA and Coherence with TopLink GridJPA and Coherence with TopLink Grid
JPA and Coherence with TopLink Grid
 
AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER WR...
AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER  WR...AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER  WR...
AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER WR...
 
C0312023
C0312023C0312023
C0312023
 
Scalable scheduling of updates in streaming data warehouses
Scalable scheduling of updates in streaming data warehousesScalable scheduling of updates in streaming data warehouses
Scalable scheduling of updates in streaming data warehouses
 
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
 
Cost Based Optimizer - Part 2 of 2
Cost Based Optimizer - Part 2 of 2Cost Based Optimizer - Part 2 of 2
Cost Based Optimizer - Part 2 of 2
 
Row Level Security in databases advanced edition
Row Level Security in databases advanced editionRow Level Security in databases advanced edition
Row Level Security in databases advanced edition
 
A BRIEF REVIEW ALONG WITH A NEW PROPOSED APPROACH OF DATA DE DUPLICATION
A BRIEF REVIEW ALONG WITH A NEW PROPOSED APPROACH OF DATA DE DUPLICATIONA BRIEF REVIEW ALONG WITH A NEW PROPOSED APPROACH OF DATA DE DUPLICATION
A BRIEF REVIEW ALONG WITH A NEW PROPOSED APPROACH OF DATA DE DUPLICATION
 
IRJET- Enhanced Density Based Method for Clustering Data Stream
IRJET-  	  Enhanced Density Based Method for Clustering Data StreamIRJET-  	  Enhanced Density Based Method for Clustering Data Stream
IRJET- Enhanced Density Based Method for Clustering Data Stream
 
In Defense Of Core Data
In Defense Of Core DataIn Defense Of Core Data
In Defense Of Core Data
 

Similar to 226 team project-report-manjula kollipara

Climbing the beanstalk
Climbing the beanstalkClimbing the beanstalk
Climbing the beanstalkgordonyorke
 
W-JAX Performance Workshop - Database Performance
W-JAX Performance Workshop - Database PerformanceW-JAX Performance Workshop - Database Performance
W-JAX Performance Workshop - Database PerformanceAlois Reitbauer
 
Mow2012 data services
Mow2012 data servicesMow2012 data services
Mow2012 data servicesSyed Shaaf
 
Realize better value and performance migrating from Azure Database for Postgr...
Realize better value and performance migrating from Azure Database for Postgr...Realize better value and performance migrating from Azure Database for Postgr...
Realize better value and performance migrating from Azure Database for Postgr...Principled Technologies
 
Datastage to ODI
Datastage to ODIDatastage to ODI
Datastage to ODINagendra K
 
[ACNA2022] Hadoop Vectored IO_ your data just got faster!.pdf
[ACNA2022] Hadoop Vectored IO_ your data just got faster!.pdf[ACNA2022] Hadoop Vectored IO_ your data just got faster!.pdf
[ACNA2022] Hadoop Vectored IO_ your data just got faster!.pdfMukundThakur22
 
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...CitiusTech
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Bhupesh Bansal
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop User Group
 
Optimizing your java applications for multi core hardware
Optimizing your java applications for multi core hardwareOptimizing your java applications for multi core hardware
Optimizing your java applications for multi core hardwareIndicThreads
 
performance_tuning.pdf
performance_tuning.pdfperformance_tuning.pdf
performance_tuning.pdfAlexadiaz52
 
performance_tuning.pdf
performance_tuning.pdfperformance_tuning.pdf
performance_tuning.pdfAlexadiaz52
 
ISE 730 flash enabled Hybrid Storage Array, 60,000 IOPS @ full capacity
ISE 730 flash enabled Hybrid Storage Array, 60,000 IOPS @ full capacityISE 730 flash enabled Hybrid Storage Array, 60,000 IOPS @ full capacity
ISE 730 flash enabled Hybrid Storage Array, 60,000 IOPS @ full capacityX-IO Technologies
 

Similar to 226 team project-report-manjula kollipara (20)

Climbing the beanstalk
Climbing the beanstalkClimbing the beanstalk
Climbing the beanstalk
 
Dremel Paper Review
Dremel Paper ReviewDremel Paper Review
Dremel Paper Review
 
W-JAX Performance Workshop - Database Performance
W-JAX Performance Workshop - Database PerformanceW-JAX Performance Workshop - Database Performance
W-JAX Performance Workshop - Database Performance
 
Mow2012 data services
Mow2012 data servicesMow2012 data services
Mow2012 data services
 
Realize better value and performance migrating from Azure Database for Postgr...
Realize better value and performance migrating from Azure Database for Postgr...Realize better value and performance migrating from Azure Database for Postgr...
Realize better value and performance migrating from Azure Database for Postgr...
 
Datastage to ODI
Datastage to ODIDatastage to ODI
Datastage to ODI
 
[ACNA2022] Hadoop Vectored IO_ your data just got faster!.pdf
[ACNA2022] Hadoop Vectored IO_ your data just got faster!.pdf[ACNA2022] Hadoop Vectored IO_ your data just got faster!.pdf
[ACNA2022] Hadoop Vectored IO_ your data just got faster!.pdf
 
Jdbc
JdbcJdbc
Jdbc
 
ORDBMS.pptx
ORDBMS.pptxORDBMS.pptx
ORDBMS.pptx
 
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
 
Data access
Data accessData access
Data access
 
Poster (1)
Poster (1)Poster (1)
Poster (1)
 
DIET_BLAST
DIET_BLASTDIET_BLAST
DIET_BLAST
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
Optimizing your java applications for multi core hardware
Optimizing your java applications for multi core hardwareOptimizing your java applications for multi core hardware
Optimizing your java applications for multi core hardware
 
performance_tuning.pdf
performance_tuning.pdfperformance_tuning.pdf
performance_tuning.pdf
 
performance_tuning.pdf
performance_tuning.pdfperformance_tuning.pdf
performance_tuning.pdf
 
Mule jdbc
Mule   jdbcMule   jdbc
Mule jdbc
 
ISE 730 flash enabled Hybrid Storage Array, 60,000 IOPS @ full capacity
ISE 730 flash enabled Hybrid Storage Array, 60,000 IOPS @ full capacityISE 730 flash enabled Hybrid Storage Array, 60,000 IOPS @ full capacity
ISE 730 flash enabled Hybrid Storage Array, 60,000 IOPS @ full capacity
 

Recently uploaded

Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Shubhangi Sonawane
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.christianmathematics
 
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesEnergy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesShubhangi Sonawane
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfPoh-Sun Goh
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docxPoojaSen20
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxAreebaZafar22
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701bronxfugly43
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docxPoojaSen20
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfChris Hunter
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxheathfieldcps1
 

Recently uploaded (20)

Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesEnergy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 

226 team project-report-manjula kollipara

  • 1. Project I – BigData CMPE 226 – Database Design Submitted by Manjula Kollipara Roopa Penmetsa Sumalatha Elliadka Sridhar Srigiriraju Project Advisor Prof. John Gash October, 2012
  • 2. Abstract The goal of this project is to understand the ORM and JDBCmethodologies and explore design considerations for managing big data. Introduction Our project is implemented for handling huge data sets and measuring the performance . The design considerations are implemented using both ORM and JDBC frameworks. In order to compare the performance results, we tried to have similar implementation design and resources. The data set for this project has weather details downloaded from the internet. The data has three files, but we decided to ignore one of them which is a sub-set of another file. Hereon, the main files are going to be referenced asStation and Forecast. Though the Station data is mostly constant, the forecast data grows dynamically. Implementation In this project, we downloaded the weather data from the internet using shell script and collected data for 4 days. Approximately 2 GB of data is collected and loaded to the database using file reading mechanisms. The following table provides the general overview of implementation details. Implementation Details Files DB Table names mesowest_csv.tbl Station mesowest.out Forecast Number of Tuples Station ~3.4 M Forecast ~3.5M System Set-up Number of Cores 2.9 GHz Intel Core i7 Memory 8GB
  • 3. Tools & Technology IDE Eclispe Tools ORM, JDBC, Progamming Languages Java, XML, SQL Testing Strategies Unit test case, HQL, Criteria Performance checks Loading data: The script downloads the data into separate folders with the corresponding time stamp. All the projects load data from these folders and store into the database. This way we tried to compare the load time of all the designs. Insert/update/retrieve operations on data: These operations are encountered while loading the data and can be significantly affected by the database design aspects like relationships and constraints. Search or find operation on data: The test queries focus mainly on the speed in fetching data based on few search criteria. SQL vs HQL: By implementing queries in test cases using SQL/HQL, we tried to compare the performance based on the query language specific to JDBC and Hibernate. Design/ Approaches considered: 1. Denormalized Data To compare the load performance Performed basic clean up: Redundancy of data, invalid data Station and Forecast data is just loaded from the folders into database
  • 4. No relationships in this design Station and Forecast tables do not hold constraints or integrity rules on any columns Observations JDBC Provides COPY command that allows user to dump the data from file into database. If no integrity rules adopted, then COPY command breaks and throws exception In this project, base version inserts tuple by tuple into the database. This method requires hitting database every time inserts are performed. Hibernate ORM allows traversing object by object. Developer can control the number of hits to database (Can perform batch updates). Does not provide an option load all 3.X M of data at a time into the database. In this project data is loaded in batches of size 100 records at a time. 2. Normalized data To analyze the effects of normalization on huge data Removed redundant MNET, SLAT, SLON, SELV columns from Forecast table Assumption: StationprimaryId is unique across all the Stations.
  • 5. Observations JDBC: IntroducesStationId to avoid duplicate Station information in the database. Normalized the Forecast table because it is dynamic. Implemented foreign key in Forecast referring to Station. For each Forecast entry, the corresponding ‘StationId’ has to be looked up from Station. Implemented stored procedure to implement ‘SaveOrUpdate’ strategy. It also ensuresreferential constraints are not violated Implementation is comparatively simple. Hibernate: Implemented OneToMany between Station and Forecast table. The data is loaded to both tables simultaneously by associating Station object with Forecast object during runtime. This approach involved alot of file searching time and resulted in performance degradation. Although all the fields are stored as string a better approach would be to use appropriate data types for double and date fields. Developer had control over the Database accessibility. Implementation is complex, since hibernate had many hidden/background implementation. For example, extra logic needs to be implemented to map a detached object with current session.
  • 6. 3. Indexing To analyze the search query performance in test cases To analyze the load performance when columns are indexed Indexed frequently used columns in the test cases. In project we tried to implement indexing on temperature field in Forecast table. Implicitly created indexes on primary keys by postgres are not considered for the difference in performance since they exist in the non-indexed version too. This method is implemented only for hibernate version. Observation Indexing resulted in extra loading time but search queries were faster. But, we tested this design on a separate system and the results were not comparison compatible. 4. Partitioning Performed Partition in Forecast rather on Stations. Since size of Stations was small and mostly constant, not much benefit in partitioning it. Since Forecast is more dynamic and keeps increasing, more benefit in partitioning it. Implemented partitioning in both JDBC and Hibernate. Number of Forecast Partitions = 2 o Assumed primaryId to be unique across all the Stations. o Location algorithm: Determines the Forecast Partition# using a hash on the Station’s primaryId. Due to lack of time, we could not get through the testing performance numbers. Observations JDBC: Number of partitions can be changed by changing a single variable in JDBCUtil.Java Uses ForeignKey relation from Forecast to Station. Implemented using a single Forecast class, but the JDBC query is executed on the appropriately selected partition
  • 7. Hibernate: Number of partitions can be changed by changing a single variable in HibernateUtil.Java Uses @ManyToOne relation from Forecast to Station. Implemented using an abstract base Forecast class and two child classes Forecast1, Forecast2 that represent the 2 partitons. Initially used @Inheritance(strategy=InheritanceType.TABLE_PER_CLASS) strategy and then moved on to @MappedSuperclass strategy. Partitioning served as a good exercise before moving on to sharding. 5. Sharding To analyze the performance of the normalized version when sharded into 2 databases Implemented sharding in both JDBC and Hibernate. Number of shards = 2 o Assumed primaryId to be unique across all the Stations. o Location algorithm: Determines the Shard# using a hash on the Station’s primaryId. Multiple postgres instances on the same host. Observations
  • 8. JDBC: Developer had to think from database designer perspective to design and implement sharding methodology. Number of shards can be changed by changing a single variable in JDBCUtil.Java Uses ForeignKey relation from Forecast to Station. Implemented using a single Station and Forecast class, but the JDBC query is executed using the connection to the appropriately selected instance. Hibernate: Found implementation process was very developer friendly. Implemented ManyToOne strategy for creating relationships. Also, the Station table is loaded first and then Forecast data is entered to database. The relationship is mapped by retrieving Station object from database and mapped to Forecast details while inserting Forecast data. This avoided the file searching and mapping time and provided performance gain. Proper data types are used for the columns in the database. Number of shards can be changed by changing a single variable in HibernateUtil.Java Uses @ManyToOne relation from Forecast to Station. Implemented using a single Station and Forecast class, but the object is saved using the session to the appropriately selected instance.
  • 9. Test cases Three types of search queries are written to test the query performance of various types approaches we used in this project. The following section describes the various approaches used by us to improve the performance. For a given ‘Stationid’: Fetches related records from Stationand Forecast tables. For a given temperature range: Fetches all the related StationandForecast details. For a given Forecast time: Fetches all the relatedStation and Forecast details. Performance comparisontable: JDBC Hibernate Denormalization* Loading(in hrs) 10 1.3 194 986 Test cases 10500 4099 (in msecs) 780 1033 Normalization* Loading(in hrs) 3 6 113 2638 Test cases 7747 16863 (in msecs) 768 1231 Indexing** Loading(in hrs) - 11 119000 Test cases - 60000 (in msecs) 2000 Sharding* Loading(in hrs) 2.2 2.45 127 1471 Test cases 6363 490267 (in msecs) 885 2539 *Performed on a Macbook with 8GB RAM **Performed on a PC with 4GB RAM
  • 10. Performance comparison graphs Design vs. Load Performance (in hrs) 10 9 8 7 6 JDBC 5 4 HIBERNATE 3 2 1 0 Denormalized Normalized Sharding Denormalized: 1. In JDBC the records are loaded to the database tuple by tuple. This involves hitting the database each time and resulted in huge performance loss. 2. In Hibernate data is loaded in batches. Since no constraints were there, that data is loaded faster and took less write time. Normalized: 1. In JDBC since the data is normalized in Forecast table, the load performance is drastically improved. 2. In hibernate: since we tried to load both Station and Forecast data simultaneously, it required reading all the record in the Forecast file for each Station. This resulted in high file reading time and resulted in performance loss. Also, hibernate is slower than JDBC because of addition ORM layer. Sharding: 1. The load time is faster since the data is distributed and results in database operations on comparatively lesser set of records
  • 11. Design vs. Query1 Performance (in msecs) 3000 2500 2000 JDBC 1500 HIBERNATE 1000 500 0 Denormalized Normalized Sharding Query1: Finding all the details(including Forecasts) related to the Station by passing Station Id as parameter. Denormalized: 1. In JDBC join is faster than ORM HQL join since, JDBC is directly accessing the database. Normalized: 1. In JDBC join is faster than ORM Lazy fetching since, JDBC is directly accessing the database. 2. For a normalized version, the query performance can be improved by Egar fetching. Sharding: 1. In both JDBC and Hibernate, the performance is improved, since less number records are involved in searching the data.
  • 12. Design vs. Query2 Performance (in msecs) 500000 450000 400000 350000 300000 JDBC 250000 200000 HIBERNATE 150000 100000 50000 0 Denormalized Normalized Sharding Query2: Fetching all the records false between the temperature range 80 and 90. 1. This query has high response time due to the larger amount records (~2.3 million) retrieved from database. 2. JDBC Query performs faster in normalized version due to foreign key constraint. And performance is improved in sharding due to less searching involved. 3. Hibernate HQL queries are faster in denormalized version since there is less mapping involved irresptive of query type. 4. The response time increased for hibernate sharding since, data is fetched from both shards and an inner query needs to be performed to get the mapped ids. Also ORM mapping time increases due to extra sessions and transactions involved.
  • 13. Design vs. Query3 Performance (in msecs) 2500 2000 1500 JDBC 1000 HIBERNATE 500 0 Denormalized Normalized Sharding Query: Fetching all the records for a particular timestamp The reasons for the performance variations in this graph are same as the query2. The decrease in time is due to the lesser number of records returned compared to the query2.
  • 14. Lessons Learnt: Manually reading from each file takes lot of time. Hibernate requires less lines of code compared to JDBC. But the analysis and usability of hibernate is much more complex compared to JDBC. Hibernate demands a steep learning curve due to the complicated background processes. Easy to implement using hibernate when there is a database change compared to JDBC. The future work of this project includes comparing the performance by introducing second level cache to hibernate and caching mechanisms for JDBC. Avoid triggers, indexes while loading the data. The partitions or shards should be distributed as uniformly as possible. Hibernate provides a lot of flexibility in writing queries compared to JDBC. Criteria are useful when the query needs to be generated dynamically. Also the various fetching strategies (eager/lazy fetching) provide the flexibility to frame the selection process. Old paradigm with JDBC, but easier to debug.New paradigm with Hibernate, tough to debug. Thinking and working with objects using serialization is better than thinking and working with SQL query strings. Using Criteria to perform queries is powerful enough to retrieve related objects. This required explicit joins or multiple queries in JDBC. Using Hibernate, the table structures were controlled directly from object definition, provided better control and flexibility to recreate the tables. Using JDBC this requires matching the SQL structure with the Java objects/code. Sessions are an important concept in Hibernate that improves performance using batching and caching. In JDBC, caching is explicitly coded by developer. There is no ‘SaveOrUpdate’ concept in JDBC/SQL queries. This has to be implemented using a stored procedure that performs an ‘update’ else ‘insert’. Hibernate provides this flexibility. JDBC can have update conflicts when two threads can simultaneously update a record, unless extra version information is embedded in each query by the developer. Using Hibernate’s @version it is easy to achieve this behavior. Using composite keys is a bit more work in Hibernate, but easy and flexible. Explicit ResultSet to object conversion using JDBC. Using Hibernate, this happens implicitly. Using JPA Relationships provides lot of flexibility in embedding related objects and expressing relations between them.
  • 15. Instructions to run program 1. mesowest.sh is used to download files from the internet. Open it with an editor and change the variable ‘arch’ to the path you want store the downloaded data. 2. For all the project folders make the following changes to DirectoryRead.java file: Change the path of variable homePath to the path where all downloaded folders are kept. Optionally, this path can be provided using the 1st command-line argument. Change the path of variable archivePath to the path to where you want to move the data after reading from file. Optionally, this path can be provided using the 2nd command-line argument. 3. For all hibernate projects make the following changes to hibernate.cfg.xml file: Edit the configuration details. Optionally, an alternate configuration file can be provided using the 3rd command-line argument. 4. For all JDBC related projects make the following changes to JDBCUtil.java file: Change the database connection details. By default, 1st instance is at localhost:5432 and 2nd instance is at localhost:5433. 5. For all JDBC related projects run .sql files to create the database schema. 6. In AutomatedBaseVersion, we implemented the automation of directly downloading data from the website and load it to the database. Contribution Manjula Kollipara o JDBC Normalization, JDBC Partitioning, JDBC Horizontal Sharding o Hibernate Partitioning, Hibernate Horizontal Sharding o Jar& Ant builds, Database design Roopa Penmetsa o Hibernate indexing version o Test cases for all hibernate versions o Ant builds Sumalatha Elliadka o Hibernate base version, Hibernate normalization.
  • 16. o JDBCJUnit library functions o Test scripts Sridhar Srigiriraju: o Created shell scripts to automate the data download process o Database design References 1. http://www.mkyong.com/hibernate/ 2.http://docs.jboss.org/hibernate/orm/4.0/devguide/en-US/html_single/ 3.https://forum.hibernate.org/viewtopic.php?f=1&t=966223