2. Hive Evolution
• Original
• Let users express their queries in a high-level language without having to
write MapReduce programs.
• Mainly target to ad-hoc queries.
• As a data tool, usually work in CLI mode.
• Now more …
• A parallel SQL DBMS that happens to use Hadoop for its storage and
execution layers.
• Ad-hoc + regular
• As a service …
3. Introduction
• Limitations of HiveServer1
•
•
•
•
Concurrency
Security
Client Interface
Stability
• Sessions/Currency
• Old Thrift API and server implementation
didn’t support currency.
• xDBC
• Old Thrift API didn’t support common xDBC
• Authentication/Authorization
• Incomplete implementations
• Auditing/Logging
HiveServer2:
• From hive-0.11 / CDH4.1
• Reconstructed and Re-implemented.
(HIVE-2935)
• HiveServer2 is a container for the Hive
execution engine (Driver).
• For each client connection, it creates a
new execution context (Connection and
Session) that serves Hive SQL requests
from the client.
• The new RPC interface enables the server
to associate this Hive execution context
with the thread serving the client’s
request.
5. hiveServer2
Architecture:
Internal
Client-1
(main entry)
start
Thrift RPC Iface
Client-2
thriftCLIService
(TThreadPoolServer,
implements Client RPC Iface)
lIsten() and accept() new client connection, and process in each Thread)
• Core Contexts
• Connections
• Sessions
• Operations
• Operation Path …
Threads for Client Connections
…
call (ICLIService internal interface)
cliService
(Real implementations of
various operations)
open/close sessions, run operations in existing sessions …
HiveSession Interface
session
HiveConf, SessionState
sessionManager
backgroundOperationPool
runAsync
session
HiveConf, SessionState
operationManager
Threads for Async Operations
…
(handleToSessionMap)
...
...
session
HiveConf, SessionState
(handleToOperationMap)
create and run operations
SQLop
sync/async
create and run hive Driver
Hive Driver
op
op
...
op
SQLOp/SetOp/DfsOp/AddResourceOp/DeleteResourceOp ..
GetTypeInfoOp/GetCatalogsOp/GetSchemasOp/GetTablesOp/
GetTableTypesOp/GetColumnsOp/GetFunctionsOp ...
6. Architecture: Server Context
•
•
•
•
Client-1
Connection-1
(Thread)
Client
Connection (Thread)
Session (-> HiveConf, SessionState)
Operation (-> Driver)
Client-2
Connection-2
(Thread)
Session-12
• Usually, a client only opens one
Session in a Connection. (refer to JDBC
HiveDriver: HiveConnection)
Op-121
(SQL)
Driver
Session-11
Op-122
Op-123
(SQL)
Driver
7. Session
New Client API
SQL and Hive
Operation
• TCLIService.thrift
• Complete API
• Complete Database API
Hive
Command
Operation
DB Metadata
Operation
• Think about JDBC/ODBC
• To be compatible with
existing DB software
• Hive Specific API
• Best Practice
Operation for
Operation
• Client API vs. Internal
API
• Converting and Isolation
Get Result
OpenSession
CloseSession
ExecuteStatement
GetInfo *
GetTypeInfo
GetCatalogs
GetSchemas
GetTables
GetTableTypes
Client request to open a new session. A new HiveSession is created
in server and return a unique SessionHandler (UUID). All other calls
depend on this session.
Client request to close the session. Will also close and remove all
operations in this session.
Execute a HQL statement. SQLOp
Some SQL statement can be tagged “runAsync”, then it will be
executed in a dedicated Thread and return immediately.
SetOp,DfsOp,AddResourceOp,DeleteResourceOp
Get various global variables of Hive. (Key-Type->Value)
Get the detailed description and constraint of data type.
Do nothing so far.
Get schema from metastore.
Get table schema from metastore.
Get the table type, e.g. MANAGED_TABLE, EXTERNAL_TABLE,
VIRTUAL_VIEW, INDEX_TABLE.
GetColumns
Get columns of a table from metastore.
GetFunctions
Get the UDF functions.
GetOperationStatu Get state of an operation by opHandler, INITIALIZED/
s
RUNNING/FINISHED/CANCELED/CLOSED/ERROR/UNKNOWN/PENDI
NG.
CancelOperation
Cancel a RUNNING or PENDING operation by opHandler.
For SQLOp, do cleanup: close and destroy Hive Driver, delete temp
output files, and cancel the task running in the background thread…
CloseOperation
Remove this operation and close it: for SQLOp, do cleanup; for
HiveCommandOp, tearDownSessionIO.
GetResultSetMeta Get the resultset’s schema, such as the title columns.
data
FetchResults
Fetch the result rows from the real resultset.
8. Code
• Packages
• org.apache.hive.service …, top project of apache…
• Pros
• Clear Implementation
• Decoupling of HiveServer2 and HiveCore
• Decoupling of Thrift Client API and Internal Code
• Cons
•
•
•
•
Too many design pattern.
Somewhere, inconsistent principle.
Still not complete decoupling of HiveServer2 and HiveCore.
The JDBC Driver package/jar still relies on many other core code, such Hive->Hadoop and the
libs… (may be because of the support of Embedded Mode.)
9. Service
+state
CompositeService
Code
HiveServer2
AbstractService
+serviceList
+HiveConf: Global,set by init()
+addService()
+removeService()
+main(): 入口
+init()
+start()
+stop()
+register(): StateChangeListener
TCLIService.Iface
ThriftCLIService
ThrifyBinaryService
+cliService
ICLIService
TThreadPoolServer
+openSession()
+closeSession()
+getInfo()
+executeStatement()
+...()
+fetchResults()
CLIService
+sessionManager
FixedThreadPool
+OpenSession()
+CloseSession()
+GetInfo()
+ExecuteStatement()
+...()
+FetchResults()
OperationManager
+handleToOperation: HashMap
+newExecuteStatementOperation()
+newGetTypeInfoOperation()
+...()
+addOperation()
+removeOperation()
+getOperation()
+getOperationState()
+cancelOperation()
+closeOperation()
+getOperationNextRowSet()
+...()
SessionManager
+handleToSession: HashMap
+operationManager
+backgroundOperationPool
HiveSession
HiveSessionImpl
+sessionHandle
+hiveConf: new for each
+sessionState: new for each
+opHandleSet
+openSession()
+closeSession()
+getSession()
+...()
+submitBackgroundOperation()
Operation
+opHandle
+parentSession
+state
+getState()
+setState()
+run()
+getNextRowSet()
+close()
+cancel()
+...()
+getSessionHandle()
+getInfo()
+executeStatement()
+executeStatementAsync()
+...()
+fetchResults()
GetInfoOperation
ExecuteStatementOperation
SQLOperation
AddResourceOperation
DeleteResourceOpetation
DfsOperation
SetOperation
GetSchemasOperation
XXXOperation
This is just a quick view, may be not exact
in some detail, and intentionally missed
something not so important.
10. HiveCore and Depending
Hive
Env.?
• HiveConf
• Global instance
• Instance for each Session.
• Client can inject additional KeyValue style configurations when
OpenSession.
• Set an explicit session name(id) to
control the download directory
name.
• Hive SessionState
• Instance for each Session.
• Hive Driver
• Instance for each SQL Operation.
• Global static variables?
• ??
• SetOperation ->SetProcessor
• set env: variables can not be set.
• set system: global
System.getProperties().setProperty(..)
• We may forbid system setting? Or, only
administrator can do it?
• set hiveconf: instanced.
• set hivevar: instanced.
• Set: instanced
• AddResource and DeleteResourceOperation
• SessionState. add_resource/delete_resource
• DOWNLOADED_RESOURCES_DIR("hive.downlo
aded.resources.dir",
System.getProperty("java.io.tmpdir") +
File.separator + "${hive.session.id}_resources")
• DfsOperation
• Auth. With HDFS?
11. Handler (Identifier)
• SessionHandler
• OperationHandler
Theift IDL:
• Use UUID
struct THandleIdentifier {
// 16 byte globally unique identifier
// This is the public ID of the handle and
// can be used for reporting.
1: required binary guid,
Now, only the public ID is used, it’s OK.
// 16 byte secret generated by the server
// and used to verify that the handle is not
// being hijacked by another user.
2: required binary secret,
}
17. Think More …
• Thinking of XX as Platform
• Standard JDBC/ODBC
• RESTful API over HTTP, Web Service
• AWS Redshift, SimpleDB …
• Hive as a Service?
• http://www.qubole.com/
• Request Cluster, run SQL ad-hoc and Regularly, workflow and schedule.
• Language
• SQL, R, Pig
• Computing of Estimation, Probability …