1. Project I – BigData
CMPE 226 – Database Design
Submitted by
Manjula Kollipara
Roopa Penmetsa
Sumalatha Elliadka
Sridhar Srigiriraju
Project Advisor
Prof. John Gash
October, 2012
2. Abstract
The goal of this project is to understand the ORM and JDBCmethodologies and explore design
considerations for managing big data.
Introduction
Our project is implemented for handling huge data sets and measuring the performance . The design
considerations are implemented using both ORM and JDBC frameworks. In order to compare the
performance results, we tried to have similar implementation design and resources.
The data set for this project has weather details downloaded from the internet. The data has three
files, but we decided to ignore one of them which is a sub-set of another file. Hereon, the main
files are going to be referenced asStation and Forecast. Though the Station data is mostly
constant, the forecast data grows dynamically.
Implementation
In this project, we downloaded the weather data from the internet using shell script and collected
data for 4 days. Approximately 2 GB of data is collected and loaded to the database using file
reading mechanisms. The following table provides the general overview of implementation
details.
Implementation Details
Files DB Table names
mesowest_csv.tbl Station
mesowest.out Forecast
Number of Tuples
Station ~3.4 M
Forecast ~3.5M
System Set-up
Number of Cores 2.9 GHz Intel Core i7
Memory
8GB
3. Tools & Technology
IDE Eclispe
Tools ORM, JDBC,
Progamming Languages Java, XML, SQL
Testing Strategies Unit test case, HQL, Criteria
Performance checks
Loading data:
The script downloads the data into separate folders with the corresponding time stamp.
All the projects load data from these folders and store into the database. This way we
tried to compare the load time of all the designs.
Insert/update/retrieve operations on data:
These operations are encountered while loading the data and can be significantly affected
by the database design aspects like relationships and constraints.
Search or find operation on data:
The test queries focus mainly on the speed in fetching data based on few search criteria.
SQL vs HQL:
By implementing queries in test cases using SQL/HQL, we tried to compare the
performance based on the query language specific to JDBC and Hibernate.
Design/ Approaches considered:
1. Denormalized Data
To compare the load performance
Performed basic clean up: Redundancy of data, invalid data
Station and Forecast data is just loaded from the folders into database
4. No relationships in this design
Station and Forecast tables do not hold constraints or integrity rules on any columns
Observations
JDBC
Provides COPY command that allows user to dump the data from file into database.
If no integrity rules adopted, then COPY command breaks and throws exception
In this project, base version inserts tuple by tuple into the database.
This method requires hitting database every time inserts are performed.
Hibernate
ORM allows traversing object by object.
Developer can control the number of hits to database (Can perform batch updates).
Does not provide an option load all 3.X M of data at a time into the database.
In this project data is loaded in batches of size 100 records at a time.
2. Normalized data
To analyze the effects of normalization on huge data
Removed redundant MNET, SLAT, SLON, SELV columns from Forecast table
Assumption: StationprimaryId is unique across all the Stations.
5. Observations
JDBC:
IntroducesStationId to avoid duplicate Station information in the database.
Normalized the Forecast table because it is dynamic.
Implemented foreign key in Forecast referring to Station. For each Forecast entry, the
corresponding ‘StationId’ has to be looked up from Station.
Implemented stored procedure to implement ‘SaveOrUpdate’ strategy. It also
ensuresreferential constraints are not violated
Implementation is comparatively simple.
Hibernate:
Implemented OneToMany between Station and Forecast table. The data is loaded to
both tables simultaneously by associating Station object with Forecast object during
runtime. This approach involved alot of file searching time and resulted in
performance degradation.
Although all the fields are stored as string a better approach would be to use
appropriate data types for double and date fields.
Developer had control over the Database accessibility.
Implementation is complex, since hibernate had many hidden/background
implementation. For example, extra logic needs to be implemented to map a detached
object with current session.
6. 3. Indexing
To analyze the search query performance in test cases
To analyze the load performance when columns are indexed
Indexed frequently used columns in the test cases. In project we tried to implement
indexing on temperature field in Forecast table.
Implicitly created indexes on primary keys by postgres are not considered for the
difference in performance since they exist in the non-indexed version too.
This method is implemented only for hibernate version.
Observation
Indexing resulted in extra loading time but search queries were faster. But, we tested this
design on a separate system and the results were not comparison compatible.
4. Partitioning
Performed Partition in Forecast rather on Stations. Since size of Stations was small and
mostly constant, not much benefit in partitioning it. Since Forecast is more dynamic and
keeps increasing, more benefit in partitioning it.
Implemented partitioning in both JDBC and Hibernate.
Number of Forecast Partitions = 2
o Assumed primaryId to be unique across all the Stations.
o Location algorithm: Determines the Forecast Partition# using a hash on the Station’s
primaryId.
Due to lack of time, we could not get through the testing performance numbers.
Observations
JDBC:
Number of partitions can be changed by changing a single variable in JDBCUtil.Java
Uses ForeignKey relation from Forecast to Station.
Implemented using a single Forecast class, but the JDBC query is executed on the
appropriately selected partition
7. Hibernate:
Number of partitions can be changed by changing a single variable in HibernateUtil.Java
Uses @ManyToOne relation from Forecast to Station.
Implemented using an abstract base Forecast class and two child classes Forecast1, Forecast2
that represent the 2 partitons.
Initially used @Inheritance(strategy=InheritanceType.TABLE_PER_CLASS) strategy and
then moved on to @MappedSuperclass strategy.
Partitioning served as a good exercise before moving on to sharding.
5. Sharding
To analyze the performance of the normalized version when sharded into 2 databases
Implemented sharding in both JDBC and Hibernate.
Number of shards = 2
o Assumed primaryId to be unique across all the Stations.
o Location algorithm: Determines the Shard# using a hash on the Station’s primaryId.
Multiple postgres instances on the same host.
Observations
8. JDBC:
Developer had to think from database designer perspective to design and implement
sharding methodology.
Number of shards can be changed by changing a single variable in JDBCUtil.Java
Uses ForeignKey relation from Forecast to Station.
Implemented using a single Station and Forecast class, but the JDBC query is executed using
the connection to the appropriately selected instance.
Hibernate:
Found implementation process was very developer friendly.
Implemented ManyToOne strategy for creating relationships. Also, the Station table
is loaded first and then Forecast data is entered to database.
The relationship is mapped by retrieving Station object from database and mapped
to Forecast details while inserting Forecast data. This avoided the file searching and
mapping time and provided performance gain.
Proper data types are used for the columns in the database.
Number of shards can be changed by changing a single variable in HibernateUtil.Java
Uses @ManyToOne relation from Forecast to Station.
Implemented using a single Station and Forecast class, but the object is saved using the
session to the appropriately selected instance.
9. Test cases
Three types of search queries are written to test the query performance of various types
approaches we used in this project. The following section describes the various approaches used
by us to improve the performance.
For a given ‘Stationid’: Fetches related records from Stationand Forecast tables.
For a given temperature range: Fetches all the related StationandForecast details.
For a given Forecast time: Fetches all the relatedStation and Forecast details.
Performance comparisontable:
JDBC Hibernate
Denormalization* Loading(in hrs) 10 1.3
194 986
Test cases
10500 4099
(in msecs)
780 1033
Normalization* Loading(in hrs) 3 6
113 2638
Test cases
7747 16863
(in msecs)
768 1231
Indexing** Loading(in hrs) - 11
119000
Test cases
- 60000
(in msecs)
2000
Sharding* Loading(in hrs) 2.2 2.45
127 1471
Test cases
6363 490267
(in msecs)
885 2539
*Performed on a Macbook with 8GB RAM
**Performed on a PC with 4GB RAM
10. Performance comparison graphs
Design vs. Load Performance (in hrs)
10
9
8
7
6
JDBC
5
4 HIBERNATE
3
2
1
0
Denormalized Normalized Sharding
Denormalized:
1. In JDBC the records are loaded to the database tuple by tuple. This involves hitting the
database each time and resulted in huge performance loss.
2. In Hibernate data is loaded in batches. Since no constraints were there, that data is loaded
faster and took less write time.
Normalized:
1. In JDBC since the data is normalized in Forecast table, the load performance is
drastically improved.
2. In hibernate: since we tried to load both Station and Forecast data simultaneously, it
required reading all the record in the Forecast file for each Station. This resulted in high
file reading time and resulted in performance loss. Also, hibernate is slower than JDBC
because of addition ORM layer.
Sharding:
1. The load time is faster since the data is distributed and results in database operations on
comparatively lesser set of records
11. Design vs. Query1 Performance (in msecs)
3000
2500
2000
JDBC
1500
HIBERNATE
1000
500
0
Denormalized Normalized Sharding
Query1: Finding all the details(including Forecasts) related to the Station by passing Station Id
as parameter.
Denormalized:
1. In JDBC join is faster than ORM HQL join since, JDBC is directly accessing the
database.
Normalized:
1. In JDBC join is faster than ORM Lazy fetching since, JDBC is directly accessing the
database.
2. For a normalized version, the query performance can be improved by Egar fetching.
Sharding:
1. In both JDBC and Hibernate, the performance is improved, since less number records
are involved in searching the data.
12. Design vs. Query2 Performance (in msecs)
500000
450000
400000
350000
300000
JDBC
250000
200000 HIBERNATE
150000
100000
50000
0
Denormalized Normalized Sharding
Query2: Fetching all the records false between the temperature range 80 and 90.
1. This query has high response time due to the larger amount records (~2.3 million)
retrieved from database.
2. JDBC Query performs faster in normalized version due to foreign key constraint. And
performance is improved in sharding due to less searching involved.
3. Hibernate HQL queries are faster in denormalized version since there is less mapping
involved irresptive of query type.
4. The response time increased for hibernate sharding since, data is fetched from both
shards and an inner query needs to be performed to get the mapped ids. Also ORM
mapping time increases due to extra sessions and transactions involved.
13. Design vs. Query3 Performance (in msecs)
2500
2000
1500
JDBC
1000 HIBERNATE
500
0
Denormalized Normalized Sharding
Query: Fetching all the records for a particular timestamp
The reasons for the performance variations in this graph are same as the query2. The decrease in
time is due to the lesser number of records returned compared to the query2.
14. Lessons Learnt:
Manually reading from each file takes lot of time.
Hibernate requires less lines of code compared to JDBC. But the analysis and usability of
hibernate is much more complex compared to JDBC.
Hibernate demands a steep learning curve due to the complicated background processes.
Easy to implement using hibernate when there is a database change compared to JDBC.
The future work of this project includes comparing the performance by introducing
second level cache to hibernate and caching mechanisms for JDBC.
Avoid triggers, indexes while loading the data.
The partitions or shards should be distributed as uniformly as possible.
Hibernate provides a lot of flexibility in writing queries compared to JDBC. Criteria are
useful when the query needs to be generated dynamically. Also the various fetching
strategies (eager/lazy fetching) provide the flexibility to frame the selection process.
Old paradigm with JDBC, but easier to debug.New paradigm with Hibernate, tough to debug.
Thinking and working with objects using serialization is better than thinking and working with
SQL query strings.
Using Criteria to perform queries is powerful enough to retrieve related objects. This required
explicit joins or multiple queries in JDBC.
Using Hibernate, the table structures were controlled directly from object definition, provided
better control and flexibility to recreate the tables. Using JDBC this requires matching the SQL
structure with the Java objects/code.
Sessions are an important concept in Hibernate that improves performance using batching and
caching. In JDBC, caching is explicitly coded by developer.
There is no ‘SaveOrUpdate’ concept in JDBC/SQL queries. This has to be implemented using a
stored procedure that performs an ‘update’ else ‘insert’. Hibernate provides this flexibility.
JDBC can have update conflicts when two threads can simultaneously update a record, unless
extra version information is embedded in each query by the developer. Using Hibernate’s
@version it is easy to achieve this behavior.
Using composite keys is a bit more work in Hibernate, but easy and flexible.
Explicit ResultSet to object conversion using JDBC. Using Hibernate, this happens implicitly.
Using JPA Relationships provides lot of flexibility in embedding related objects and expressing
relations between them.
15. Instructions to run program
1. mesowest.sh is used to download files from the internet.
Open it with an editor and change the variable ‘arch’ to the path you want store the
downloaded data.
2. For all the project folders make the following changes to DirectoryRead.java file:
Change the path of variable homePath to the path where all downloaded folders are kept.
Optionally, this path can be provided using the 1st command-line argument.
Change the path of variable archivePath to the path to where you want to move the data after
reading from file. Optionally, this path can be provided using the 2nd command-line argument.
3. For all hibernate projects make the following changes to hibernate.cfg.xml file:
Edit the configuration details.
Optionally, an alternate configuration file can be provided using the 3rd command-line
argument.
4. For all JDBC related projects make the following changes to JDBCUtil.java file:
Change the database connection details.
By default, 1st instance is at localhost:5432 and 2nd instance is at localhost:5433.
5. For all JDBC related projects run .sql files to create the database schema.
6. In AutomatedBaseVersion, we implemented the automation of directly downloading data
from the website and load it to the database.
Contribution
Manjula Kollipara
o JDBC Normalization, JDBC Partitioning, JDBC Horizontal Sharding
o Hibernate Partitioning, Hibernate Horizontal Sharding
o Jar& Ant builds, Database design
Roopa Penmetsa
o Hibernate indexing version
o Test cases for all hibernate versions
o Ant builds
Sumalatha Elliadka
o Hibernate base version, Hibernate normalization.
16. o JDBCJUnit library functions
o Test scripts
Sridhar Srigiriraju:
o Created shell scripts to automate the data download process
o Database design
References
1. http://www.mkyong.com/hibernate/
2.http://docs.jboss.org/hibernate/orm/4.0/devguide/en-US/html_single/
3.https://forum.hibernate.org/viewtopic.php?f=1&t=966223