Sqoop
BigData Project List
Why Sqoop:As more organizations deploy Hadoop to analyse vast streams of information, they may find they need to transfer large amount of data between Hadoop and their existing databases, data warehouses and othere data sources.
- Loading bulk data into hadoop from production systems or accessing it from map-reduce applications running on a large cluster is challenging task since transferring data using scripts its inefficient and time consuming task.
- Data already available in RDBMS worldwide
- Need to move certain data from RDBMS to Hadoop for processing
- Transferring data using scriptsis inefficient and time consuming
- Hadoop is great for storingmassive data in terms of volume using HDFS.
- It provides a scalable processing enviornment for structured and unstructured data.
- But it is batch-Oriented and thus not suitable for low late latnecy operations.
- Sqoop is basicly an ETL Tool used to copy data between HDFS and SQL databases.
- Import SQL data to HDFS for archival or analysis.
-Export HDFS to SQL
Apache Sqoop is a tool that transfers data between the Hadoop ecosystem and enterprise data stores. Sqoop does this by providing methods to transfer data to HDFS or Hive (using HCatalog)
- Apacha Sqoop is a tool designed for efficiently transferring bulk data between Apache hadoop and structured datastores such as relations databases.
- Sqoop imports data from external structured datastores into HDFS or related systems like Hive and HBase.
- Sqoop can also be used to export data from Hadoop and export it to external structured datastores as relational databases and enterprise data warehouse.
- Parallelize data transfer for fast performance and optimal system utilisation.
- Copies data quickly from external system to Hadoop and makes data analysis more efficient.
- Sqoop works with relational databases such as :teradata,Netezza, Oracle ,MySQL,Postgres and HSQLDB.
- Sqoop is data ingestion tool.simple as user specifies the "what" and leave the "how" to underlying processing engine.
- It mitigates excessive loads to external systems.
- Developed by cloudera.
- In order to connect to an external database, Sqoop users must provide a set of credentials specific to that data store. These credentials are typically in the form of user name and password.
- Sqoop provides a pluggable connector mechanism for optimak connectivity to external system.
- The Sqoop extension API providesa convenient framework for building new connectors which can be dropped into Sqoopinstallationsto provide connectivity to various systems.
- Sqoop itself comes bundled with various connectors that that can be used for popular databases and data warehousing systems.
- The dataset being transferred is sliced up into different partitions
- A Map only job is launched with individual mappers responsible for transferring a slice of datase.
- Each record of the data is maintained in a type safe ,manner since Sqoop uses the database
- Full Load
- Incremental Load
- Parallel import/export
- Import results of SQL query
- Compression
- Connectors for all major RDBMS Databases
- Kerberos Security Integration
- Load data directly into Hive/Hbase
- Support for Accumulo
- metadata to understand the data types.
Basic Commands and Syntax for Sqoop:
Connecting to database:
- The connect string is similar to URL andis communicated to Sqoop with the help of --connect argument.
- This descibes the server and database to connect and it may also specify the port.
- You can use the --username and --password or -P to supply a username and a password to the database.
For example:
$ sqoop import \
--connect jdbc:mysql://IpAddress:port/DBName \
--table tableName\
--username sqoop \
--password sqoop
Note: We are going to use mysql as RDBMS, to login to mysql in cloudera VM, use below command in terminal.
$ mysql -uroot -pcloudera
This will take you to mysql shell and you can play with mysql command and can create DB/tables and can import/export data from RDBMS to HDFS via Sqoop tool.
Sqoop-Import Commands:
Sqoop import command imports a table from an RDBMS to HDFS. Each record from a table is considered as a separate record in HDFS. Records can be stored as text files, or in binary representation as Avro or SequenceFiles.
Sqoop import command imports a table from an RDBMS to HDFS. Each record from a table is considered as a separate record in HDFS. Records can be stored as text files, or in binary representation as Avro or SequenceFiles.
Importing a Table into HDFS :
$ sqoop import \
--connect \
--table \
--username \
--password \
--as-textfile
--target-dir \
-m 1
parameter details:
--connect Takes JDBC url and connects to database (exp:jdbc:mysql://localhost/userdb/)
--table Source table name to be imported
--username Username to connect to database
--password Password of the connecting user
--target-dir Imports data to the specified directory
--as option can be anything based on fileformat you need. textfile is default if do'nt mention --as option
[--as-avrodatafile Imports data to Avro Data Files
--as-sequencefile Imports data to SequenceFiles
--as-textfile Imports data as plain text (default) ]
Importing Selected Data from Table:
$ sqoop import \
--connect \
--table \
--username \
--password \
--columns \
--where \
--target-dir \
-m 1
parameter details:
--columns Selects subset of columns (i.e col1 , col2,col3..)
--where Retrieves the data which satisfies the condition
Importing Data from Query:
$ sqoop import \
--connect \
--table \
--username \
--password \
--query \
--target-dir \
parameter details:
--query Executes the SQL query provided and imports the results
Incremental Imports:
$ sqoop import \
--connect \
--table \
--username \
--password \
--incremental \
--check-column \
--last-value \
--target-dir \
parameter details:
--incremental This option tells import command that this is incremental import
--check-column Column which is used for incremental import option.
--last-value This is last value of above given column from where incremental import has to start.
- Append
- Lastmodified.
Lastmodified mode is to be used when records of the table might be updated, and each such update will set the current timestamp value to a last-modified column. Records whose check column timestamp is more recent than the timestamp specified with --last-value are imported.
Sqoop-Import-all-Tables:
The import-all-tables imports all tables in a RDBMS database to HDFS. Data from each table is stored in a separate directory in HDFS. Following conditions must be met in order to use sqoop-import-all-tables:1. Each table should have a single-column primary key.2. You should import all columns of each table.3. You should not use splitting column, and should not check any conditions using where clause.
$sqoop import-all-tables\
--connect \
--table \
--username \
--password \
--target-dir \
Few arguments helpful with Sqoop import:
Argument | Description |
---|---|
--num-mappers,-m | Mappers to Launch |
--fields-terminated-by | Field Separator |
--lines-terminated-by | End of line seperator |
Sqoop-Export Command:
Sqoop export command exports a set of files in a HDFS directory back to RDBMS tables. The target table should already exist in the database.
Sqoop export command prepares INSERT statements with set of input data then hits the database. It is for exporting new records, If the table has unique value constant with primary key, export job fails as the insert statement fails. If you have updates, you can use --update-key option. Then Sqoop prepares UPDATE statement which updates the existing row, not the INSERT statements as earlier.
Sqoop export command prepares INSERT statements with set of input data then hits the database. It is for exporting new records, If the table has unique value constant with primary key, export job fails as the insert statement fails. If you have updates, you can use --update-key option. Then Sqoop prepares UPDATE statement which updates the existing row, not the INSERT statements as earlier.
$ sqoop export \
--connect \
--table \
--username \
--password \
--export-dir \
--input-fields-terminated-by ','
parameter details:
--export-dir Directory where data lies which need to exported
--input-fields-terminated-by ',' -->use this when required
or use below if you want to update the record if already exist based on keycolumn.$ sqoop export \
--connect \
--table \
--username \
--password \
--update-key \
--export-dir \
--input-fields-terminated-by ','
parameter details:
--update-key column name which will be used as id to update the record if already exist.
Sqoop-Job :
Sqoop job command allows us to create a job. Job remembers the parameters used to create job, so they can be invoked any time with same arguments.
Creating a Job:
$ sqoop job --create myjob \
--import \
--connect jdbc:mysql://localhost/db \
--username root \
--table employee --m 1
Show Job list:
$ sqoop job --list
Find the details of a Job:
$ sqoop job --show yourjobnam .
Execute Job:
Execute Job:
$ sqoop job --exec yourjobname
Sqoop Eval:
Sqoop 'eval' tool allows users to execute user define queries against respective database servers and preview the result in console.Sqoop 'eval' tool can be applicable for both modelling & defining the SQL statements.That means, we can use eval for insert statements too.
$ sqoop eval --connect --username --password --query
List Databases & Tables using Sqoop:
$ sqoop list-databases --connect --username --password
$ sqoop list-tables --connect --username --password
Sqoop-codegen:
From the viewpoint of object-oriented application, every database table has one DAO class that contains ‘getter’ and ‘setter’ methods to initialize objects. This tool (-codegen) generates the DAO class automatically.
It generates DAO class in Java, based on the Table Schema structure. The Java definition is instantiated as a part of the import process. The main usage of this tool is to check if Java lost the Java code. If so, it will create a new version of Java with the default delimiter between fields.
$ sqoop codegen \
--connect jdbc:mysql://localhost/userdb \
--username root \
--table emp
Importing Data from RDBMS into Hive:
Appending '--hive-import' to Sqoop import command, Sqoop takes care of populating the hive metastore with appropriate meta metadata for the table and also invokes the necessary commands to load the table and partition.
Appending '--hive-import' to Sqoop import command, Sqoop takes care of populating the hive metastore with appropriate meta metadata for the table and also invokes the necessary commands to load the table and partition.
- Using hive import, sqoop converts the data from from the native datatypes with the external datastore into the corresponding types in hive.
- Sqoop automaticly choses the native delimeter set used by hive
$ sqoop import --connect jdbc:mysql://localhost/mysqldbname \
--username root \
--password mypwd \
--table mytablename \
--hive-import \
--hive-table HiveDBName.HiveTableName \
-m 1
Importing Data frpm RDBMS into HBASE:
//Loading the data to existing HBase table
How to Install Sqoop:
$ scoop import \
--connect jdbc:mysql//localhost/mysqldbname \
--table mysqltblname \
--hbase-table HBasetableName \
--column-family HBasecolfamilyname \
--hbase-row-key id \
-m 1
//Loading the data to HBase where in HBase table does not exist
just add '--hbase-create-table' in above command which will ensure to create new table with column family given in command
just add '--hbase-create-table' in above command which will ensure to create new table with column family given in command
$ scoop import \
--connect jdbc:mysql//localhost/mysqldbname \
--table mysqltblname \
--hbase-create-table \
--hbase-table HBasetableName \
--column-family HBasecolfamilyname \
--hbase-row-key id \
-m 1
How to Install Sqoop:
- To install Sqoop, download Sqoop-*.tar.gz
- Untar the downloaded file : tar-xvf sqoop-*.*.tar.gz
- export HADOOP_HOME=/some/path/hadoop-dir
- Please add the vendor specific JDBC jar to $SQOOP_HOME/lib
- Change the Sqoop bin folder
./sqoop help
Comments
Post a Comment