Sqoop

Hadoop
------------------------------
Map Reduce | Pig | Hive
SPARK
------------------------------------------------SQL & Hive | StreamingML | GraphX
NOSQL
-----------------------
MongoDB HBase
Data Ingestion Tools
--------------------------
Sqoop Flume

BigData Project List

Why Sqoop:
As more organizations deploy Hadoop to analyse vast streams of information, they may find they need to transfer large amount of data between Hadoop and their existing databases, data warehouses and othere data sources.
  • Loading bulk data into hadoop from production systems or accessing it from map-reduce applications running on a large cluster is challenging task since transferring data using scripts its inefficient and time consuming task.
  • Data already available in RDBMS worldwide
  • Need to move certain data from RDBMS to Hadoop for processing
  • Transferring data using scriptsis inefficient and time consuming
Hadoop-Sqoop?
  • Hadoop is great for storingmassive data in terms of volume using HDFS.
  • It provides a scalable processing enviornment for structured and unstructured data.
  • But it is batch-Oriented and thus not suitable for low late latnecy operations.
  • Sqoop is basicly an ETL Tool used to copy data between HDFS and SQL databases.
    - Import SQL data to HDFS for archival  or analysis.
    -Export HDFS to SQL



Apache Sqoop is a tool that transfers data between the Hadoop ecosystem and enterprise data stores. Sqoop does this by providing methods to transfer data to HDFS or Hive (using HCatalog)
  • Apacha Sqoop is a tool designed for efficiently transferring bulk data between Apache hadoop and structured datastores such as relations databases.
  • Sqoop imports data from external structured datastores into HDFS or related systems like Hive and HBase.
  • Sqoop can also be used to export data from Hadoop and export it to external structured datastores as relational databases and enterprise data warehouse.
  • Parallelize data transfer  for fast  performance and optimal system utilisation.
  • Copies data quickly from external system to Hadoop and makes data analysis more efficient.
  • Sqoop works with relational databases such as :teradata,Netezza, Oracle ,MySQL,Postgres and HSQLDB.
  • Sqoop is data ingestion tool.simple as user specifies the "what" and  leave the "how" to underlying processing engine.
  • It mitigates excessive loads to external systems.
  • Developed by cloudera.
How Sqoop works:
  • In order to connect to an external database, Sqoop users must provide a set of credentials specific to that data store. These credentials are typically in the form of user name and password.
  • Sqoop provides a pluggable connector mechanism for optimak connectivity to external system.
  • The Sqoop extension API providesa convenient framework for building new connectors which can be dropped into Sqoopinstallationsto provide connectivity to various systems.
  • Sqoop itself comes bundled with various connectors that that can be  used for popular databases and data warehousing systems.
  • The dataset being transferred is sliced up into different partitions
  • A Map only job is launched with individual mappers responsible for transferring a slice of datase.
  • Each record of the data is maintained  in a type safe ,manner  since Sqoop uses the database
Sqoop provides many salient features like:
  • Full Load
  • Incremental Load
  • Parallel import/export
  • Import results of SQL query
  • Compression
  • Connectors for all major RDBMS Databases
  • Kerberos Security Integration
  • Load data directly into Hive/Hbase
  • Support for Accumulo
  • metadata to understand the data types.



                                         Basic Commands and Syntax for Sqoop:

                Connecting to database:
                • The connect string is similar to URL andis communicated to Sqoop with the help of --connect argument.
                • This descibes the server and database to connect and it may also specify the port.
                • You can use the --username and --password or -P to supply a username and a password to the database.
                For example:
                $ sqoop import \
                --connect jdbc:mysql://IpAddress:port/DBName  \
                --table  tableName\
                --username sqoop \ 
                --password sqoop 

                Note: We are going to use mysql as RDBMS, to login to mysql in cloudera VM, use below command in terminal.
                $ mysql -uroot -pcloudera

                This will take you to mysql shell and you can play with mysql command and can create DB/tables and can import/export data from RDBMS to HDFS via Sqoop tool.
                Sqoop-Import Commands:
                Sqoop import command imports a table from an RDBMS to HDFS. Each record from a table is considered as a separate record in HDFS. Records can be stored as text files, or in binary representation as Avro or SequenceFiles.

                Importing a Table into HDFS :

                $ sqoop import \
                --connect \
                --table \
                --username \ 
                --password \
                --as-textfile
                --target-dir \
                -m 1
                
                parameter details:
                --connect        Takes JDBC url and connects to database (exp:jdbc:mysql://localhost/userdb/)
                --table             Source table name to be imported
                --username    Username to connect to database
                --password     Password of the connecting user
                --target-dir     Imports data to the specified directory
                --as option can be anything based on fileformat you need. textfile is default if do'nt mention --as option
                [--as-avrodatafile Imports data to Avro Data Files
                --as-sequencefile Imports data to SequenceFiles
                --as-textfile Imports data as plain text (default) ]
                


                Importing Selected Data from Table:

                $ sqoop import \
                --connect \
                --table \
                --username \ 
                --password \
                --columns \
                --where \
                --target-dir \
                -m 1
                
                
                parameter details:
                --columns       Selects subset of columns (i.e col1 , col2,col3..)
                --where           Retrieves the data which satisfies the condition


                Importing Data from Query:

                $ sqoop import \
                --connect \
                --table \
                --username \ 
                --password \
                --query \
                --target-dir \
                
                parameter details:
                --query        Executes the SQL query provided and imports the results

                Incremental Imports:

                $ sqoop import \
                --connect \
                --table \
                --username \ 
                --password \
                --incremental \
                --check-column \
                --last-value \
                --target-dir \
                
                parameter details:
                --incremental        This option tells import command that this is incremental import
                --check-column     Column which is used for incremental import option.
                --last-value    This is last value of above given column from where incremental  import has to start.
                

                Sqoop provides an incremental import mode which can be used to retrieve only rows newer than some previously -imported set of rows. Sqoop import supports two types of incremental imports: 
                1. Append 
                2. Lastmodified.
                Append mode is to be used when new rows are continually being added with increasing values. Column should also be specified which is continually increasing with --check-column. Sqoop imports rows whose value is greater than the one specified with --last-value.

                Lastmodified mode is to be used when records of the table might be updated, and each such update will set the current timestamp value to a last-modified column. Records whose check column timestamp is more recent than the timestamp specified with --last-value are imported.



                Sqoop-Import-all-Tables:
                The import-all-tables imports all tables in a RDBMS database to HDFS. Data from each table is stored in a separate directory in HDFS. Following conditions must be met in order to use sqoop-import-all-tables:1. Each table should have a single-column primary key.2. You should import all columns of each table.3. You should not use splitting column, and should not check any conditions using where clause.

                $sqoop import-all-tables\
                --connect \
                --table \
                --username \ 
                --password \
                --target-dir \
                

                Few arguments helpful with Sqoop import:
                ArgumentDescription
                --num-mappers,-mMappers to Launch
                --fields-terminated-byField Separator
                --lines-terminated-byEnd of line seperator

                Sqoop-Export Command:

                Sqoop export command exports a set of files in a HDFS directory back to RDBMS tables. The target table should already exist in the database.
                Sqoop export command prepares INSERT statements with set of input data then hits the database. It is for exporting new records, If the table has unique value constant with primary key, export job fails as the insert statement fails. If you have updates,  you can use --update-key option. Then Sqoop prepares UPDATE statement which updates the existing row, not the INSERT statements as earlier.

                $ sqoop export \
                --connect \
                --table \
                --username \ 
                --password \
                --export-dir \
                --input-fields-terminated-by ','  
                
                parameter details:
                --export-dir     Directory where data lies which need to exported
                --input-fields-terminated-by ',' -->use this when required
                
                

                or use below if you want to update the record if already exist based on keycolumn.
                $ sqoop export \
                --connect \
                --table \
                --username \ 
                --password \
                --update-key \
                --export-dir \
                --input-fields-terminated-by ','
                parameter details:
                --update-key      column name which will be used  as id to update the record if already exist.


                Sqoop-Job :
                Sqoop job command allows us to create a job. Job remembers the parameters used to create job, so they can be invoked any time with same arguments.

                Creating a Job:
                $ sqoop job --create myjob \
                --import \
                --connect jdbc:mysql://localhost/db \
                --username root \
                --table employee --m 1
                Show Job list:
                $ sqoop job --list
                Find the details of a Job:
                $ sqoop job --show yourjobnam .
                Execute Job:
                $ sqoop job --exec yourjobname
                Sqoop Eval:
                Sqoop 'eval' tool allows users to execute user define queries against respective database servers and preview the result in console.Sqoop 'eval' tool can be applicable  for both modelling & defining the SQL statements.That means, we can use eval for insert statements too.
                $ sqoop eval --connect --username --password --query


                List Databases & Tables using Sqoop:
                $ sqoop list-databases --connect --username --password
                $ sqoop list-tables --connect --username --password

                Sqoop-codegen:
                From the viewpoint of object-oriented application, every database table has one DAO class that contains ‘getter’ and ‘setter’ methods to initialize objects. This tool (-codegen) generates the DAO class automatically.
                It generates DAO class in Java, based on the Table Schema structure. The Java definition is instantiated as a part of the import process. The main usage of this tool is to check if Java lost the Java code. If so, it will create a new version of Java with the default delimiter between fields.
                $ sqoop codegen \
                --connect jdbc:mysql://localhost/userdb \
                --username root \ 
                --table emp
                Importing Data from RDBMS into Hive:
                Appending '--hive-import' to Sqoop import command, Sqoop takes care of populating the hive metastore with appropriate meta metadata for the table and also invokes the necessary commands to load the table and partition.
                • Using hive import, sqoop converts the data from from the native datatypes with the external datastore into the corresponding types in hive.
                • Sqoop automaticly choses the native delimeter set used by hive
                $ sqoop import --connect jdbc:mysql://localhost/mysqldbname \
                   --username root \
                    --password mypwd \
                   --table mytablename \
                   --hive-import  \
                   --hive-table HiveDBName.HiveTableName \
                    -m 1

                Importing Data frpm RDBMS into HBASE:
                //Loading the data to existing HBase table
                $ scoop import  \
                --connect jdbc:mysql//localhost/mysqldbname  \
                --table mysqltblname  \
                --hbase-table HBasetableName  \
                --column-family HBasecolfamilyname \
                --hbase-row-key id \
                -m 1
                //Loading the data to HBase where in HBase table does not exist
                just add '--hbase-create-table' in above command which will ensure to create new table with column family given in command
                $ scoop import \
                --connect jdbc:mysql//localhost/mysqldbname \
                --table mysqltblname \
                --hbase-create-table \
                --hbase-table HBasetableName \
                --column-family HBasecolfamilyname \
                --hbase-row-key id \
                -m 1

                How to Install Sqoop:

                • To install Sqoop, download Sqoop-*.tar.gz
                • Untar the downloaded file : tar-xvf sqoop-*.*.tar.gz
                • export HADOOP_HOME=/some/path/hadoop-dir
                • Please add the vendor specific JDBC jar to $SQOOP_HOME/lib
                • Change the Sqoop bin folder
                   
                  ./sqoop help

                Comments

                Popular posts from this blog

                Exploring BigData Analytics Using SPARK in BigData World