3. Data Ingestion Options
Batch Load from RDBMS :
Sqoop : RDBMS can support multiple parallel connections . millions of rows can be imported in
a reasonable timeframe which can be scaled. Most vendors these days have a loader/connector
product that delivers better performance and more security when compared to Sqoop, For ex
Oracle has OraOop or at Oracle Big Data Connectors
Data from files :
FTP the data to edge nodes and then load the data using the ETL tool. ETL tools like informatica
/talend can be integrated . With 40 -50 Mbps speed and 5 machines 1 TB can be imported in 1 hr
. Compressing the data will result in better time frame . Files can also be consolidated at source
to fit into hadoop optimal size .
Real time Data ingestion :
Flume is good at transport and some light enrichment
Storm +queue (kafka) : Good for low-latency continuous ingestion.With storm we can do major
processing to data while ingesting .Flume vs. Storm decision should depend largely on the
amount of processing needed in-flight.
With storm we can do event processing like fraud detection and pattern matching as data is
flowing