Big Data: SQL on Hadoop - Introduction to Big SQL for SF Bay Area MeetUp, March 13, 2014

56 %
44 %
Information about Big Data: SQL on Hadoop - Introduction to Big SQL for SF Bay Area...

Published on March 12, 2014

Author: CynthiaSaracco



Self-study lab on Big SQL, a query interface for big data managed by InfoSphere BigInsights, IBM's Hadoop-based platform. Designed to be used with the free BigInsights Quick Start Edition VMware image.

Querying Hadoop data with Big SQL An introductory lab March 13, 2014 IBM Innovation Center, Foster City, CA Cynthia M. Saracco and Nicolas Morales, IBM Silicon Valley Lab For more technical resources, visit the BigInsights wiki (

IBM Software Page 2 Introduction to Big SQL Contents LAB 1 OVERVIEW......................................................................................................................................................... 3 LAB 2 SETTING UP YOUR ENVIRONMENT................................................................................................................ 5 2.1 LAUNCH BIGINSIGHTS .............................................................................................................................. 5 2.2 LAUNCH ECLIPSE..................................................................................................................................... 6 2.3 CREATE A BIGINSIGHTS SERVER CONNECTION IN ECLIPSE .......................................................................... 9 2.4 CREATE A BIG SQL CONNECTION IN ECLIPSE........................................................................................... 10 LAB 3 QUERYING DATA WITH BIG SQL .................................................................................................................. 13 3.1 CREATE A PROJECT AND AN SQL SCRIPT FILE........................................................................................... 14 3.2 CREATE SAMPLE TABLES AND LOAD SAMPLE DATA ..................................................................................... 16 3.3 RUNNING BASIC BIG SQL QUERIES.......................................................................................................... 19 3.4 ANALYZING THE DATA WITH BIG SQL....................................................................................................... 22 3.5 OPTIONAL: USING BIG SQL FROM A JDBC CLIENT APPLICATION ................................................................ 26 LAB 4 OPTIONAL: WORKING WITH SERDES......................................................................................................... 33 4.1 REGISTER A SERDE WITH BIG SQL AND HIVE........................................................................................... 33 4.2 CREATING, POPULATING, AND QUERYING A TABLE THAT USES A SERDE ....................................................... 34 LAB 5 SUMMARY ....................................................................................................................................................... 38

IBM Software Hands On Lab Page 3 Lab 1 Overview In this hands-on lab, you'll learn how to use SQL to query data stored in a Hadoop-based environment. To do so, you’ll work with Big SQL, a software layer provided by IBM that enables IT professionals to create tables and query data in InfoSphere BigInsights (its Hadoop platform) using familiar SQL statements. Organizations interested in Big SQL often have considerable SQL skills in-house, as well as a suite of SQL-based business intelligence applications and query/reporting tools. The idea of being able to leverage existing skills and tools — and perhaps reuse portions of existing applications — can be quite appealing to organizations new to Hadoop. Indeed, some companies with large data warehouses built on relational DBMS systems are looking to Hadoop-based platforms as a potential target for offloading "cold" or infrequently used data in a manner that still allows for query access. In other cases, organizations turn to Hadoop to analyze and filter non-traditional data (such as logs, sensor data, social media posts, etc.), ultimately feeding subsets or aggregations of this information to their relational warehouses to extend their view of products, customers, or services. Portions of this lab were based on the article What's the Big Deal about Big SQL? You should be familiar with the basic concepts of Big SQL covered in that article before beginning this lab. After completing this hands-on lab, you’ll be able to: • Create a Big SQL server connection • Create tables • Load data into tables • Query big data • Use Eclipse tooling to develop and test SQL scripts • Optionally, develop and run a JDBC client application • Optionally, register and use SerDes Allow 1 to 2 hours to complete this lab. This lab was developed for the InfoSphere BigInsights 2.1 single node Quick Start Edition VMware image, which is pre-configured with a specific host name and administrative user ID. Throughout this lab, you will be using the following information: OS Hostname Username Password Linux bivm biadmin biadmin

IBM Software Page 4 Introduction to Big SQL Detailed information about Big SQL and BigInsights is available online through the product documentation. General questions about Big SQL or BigInsights may be posted to the BigInsights external forum. If you aren't using the Quick Start Edition . . . . BigInsights Enterprise Edition V2.1 can also be used for this lab. Simply substitute you host name, user ID, and password throughout the exercises. Note that some steps require administrative authority, so you will need access to an appropriate user account. In addition, you will need access to an appropriate Eclipse environment to complete many exercises. (The Quick Start Edition VMware is pre-configured with an Eclipse shell that contains the BigInsights plug-in.)

IBM Software Hands On Lab Page 5 Lab 2 Setting up your environment In this section, you will verify that the necessary tools and services are up and running so you can begin working with BigInsights and Big SQL. After completing this hands-on lab, you’ll be able to: • Launch BigInsights and verify that all required services are running. • Launch Eclipse and verify that the BigInsights plug-in has been installed. • Create a BigInsights server connection. • Create a Big SQL server connection. Allow 30 minutes to complete this lab. 2.1 Launch BigInsights To work with Big SQL, you must verify that the appropriate BigInsights services are up and running. You can skip this section if you already know that all BigInsights services (except for Monitoring) are active. To start BigInsights, __1. Click on the Start BigInsights desktop icon. Wait until the operation completes. (This may take a few moments, depending on your available machine resources.) __2. Launch the Web console. (Click on the desktop icon or type the appropriate URL into your Web browser, such as http://your_hostname:8080.) __3. Click on the Cluster Status tab and verify that all services except Monitoring have been started. Depending on your machine resources, you may need to wait for a few refresh cycles for the display to fully reflect the current status of your cluster. NOTE: This icon is provided as a shortcut on the Quick Start Edition VMware image. Clicking on it simply invokes the script found in the $BIGINSIGHTS_HOME/bin directory. Alternatively, you can open a terminal window, navigate to this directory, and enter ./ on the command line.

IBM Software Page 6 Introduction to Big SQL 2.2 Launch Eclipse Most of the remaining exercises use the BigInsights tooling for Eclipse to run Big SQL queries, so you need to launch Eclipse and verify its set up now. __1. Click on the Eclipse desktop icon or launch Eclipse following your standard process. __2. When prompted to select a workspace, accept the default and wait until Eclipse opens. Verify that the Task Launcher for Big Data appears. If so, skip the remaining steps and continue to the next section.

IBM Software Hands On Lab Page 7 __3. If the Task Launcher does not appear, open the BigInsights perspective. From the Eclipse menu items at top, select Window > Perspective > BigInsights. (If necessary, click Window > Perspective > Other > BigInsights.) __4. If BigInsights isn't available through the Perspectives menu, the plug-in probably hasn't been installed. You can verify the installed software through Help > About Eclipse SDK -> Installation Details. If the necessary plug-in has been properly installed, you'll see two entries for InfoSphere BigInsights. Close the pop-up windows when you're done.

IBM Software Page 8 Introduction to Big SQL __5. If the BigInsights plug-in is not present or is not properly installed, you will need to download and install it. The Welcome tab of the Web console contains a Quick Link with the necessary instructions.

IBM Software Hands On Lab Page 9 2.3 Create a BigInsights Server Connection in Eclipse Certain tasks require a live connection to a BigInsights cluster. This section describes how you can define a BigInsights server connection in Eclipse. If you're working with the Quick Start Edition VMware image, this section is optional. (The image is pre- configured with a BigInsights server connection.) __1. From the Overview tab of the Task Launcher for Big Data, click Create a BigInsights server connection. __2. Enter the appropriate information in the pop-up window, including the URL to access your BigInsights Web console, a server name of your choice, a valid BigInsights user ID, and a password. The information shown below contains information for the BigInsights Quick Start Edition VMware image (V2.1.0.1).

IBM Software Page 10 Introduction to Big SQL __3. Click the Test connection button and verify that you can successfully connect to your target cluster. __4. Click the Save password box and Finish. (If you are prompted to enter a Secure Password, enter biadmin and optionally provide two security questions and answers of your choice.) __5. In the BigInsights Server pane, expand the list of servers and verify that the server connection you created appears. 2.4 Create a Big SQL Connection in Eclipse Certain tasks require a live connection to a Big SQL server within the BigInsights cluster. This section explains how you can define a JDBC connection to your Big SQL server. If you're working with the Quick Start Edition VMware image, this section is optional. (The image is pre- configured with a Big SQL connection.) __1. Open the Database Development perspective. Window > Open Perspective > Other > Database Development. __2. In the Data Source Explorer pane, right click on Database Connections > Add Repository.

IBM Software Hands On Lab Page 11 __3. In the New Connection Profile menu, select Big SQL JDBC Driver and enter a name for your new driver (e.g., My Big SQL Connection). Click Next. __4. Enter the appropriate connection information for your environment, including the host name, port number (7052, by default), user ID, and password. Verify that you have selected the correct JDBC driver at top. The information shown below contains information for the free BigInsights Quick Start Edition VMware image (V2.1.0.1).

IBM Software Page 12 Introduction to Big SQL __5. Click the Test connection button and verify that you can successfully connect to your target Big SQL server. __6. Click the Save password box and Finish. __7. In the Data Source Explorer, expand the list of data sources and verify that your Big SQL connection appears. __8. Return to the BigInsights perspective. You're now ready to create and query Big SQL tables.

IBM Software Hands On Lab Page 13 Lab 3 Querying data with Big SQL In this lab you will begin working with Big SQL to query and analyze data in a data warehouse system for Hadoop. Big SQL provides broad SQL support based on the ISO SQL standard. You can issue queries using JDBC or ODBC drivers to access data that is stored in InfoSphere BigInsights in the same way that you access relational databases from your enterprise applications. Multiple queries can be executed concurrently. The SQL query engine supports joins, unions, grouping, common table expressions, windowing functions, and other familiar SQL expressions. Depending on the nature of your query, your data volumes, and other factors, Big SQL can use Hadoop's MapReduce framework to process various query tasks in parallel or execute your query locally within the Big SQL server on a single node — whichever may be most appropriate for your query. This tutorial uses data from the fictional Sample Outdoor Company, which sells and distributes outdoor products to third-party retailer stores around the world. The company also sells directly to consumers through its online store. For the last several years, the company has steadily grown into a worldwide operation, selling their line of products to retailers in nearly every part of the world. You will write Big SQL queries and review their results to learn more about the products and sales of this firm. After you complete the lessons in this module, you will understand how to: • Create Big SQL tables that use Hive as the underlying storage manager. • Load sample data from local files into Big SQL tables. • Create and run Big SQL queries from Eclipse. • Optionally, create and run a JDBC client application for Big SQL using Eclipse. Allow 1 – 1.5 hours to complete this lab.

IBM Software Page 14 Introduction to Big SQL 3.1 Create a project and an SQL script file To begin, you will create a BigInsights project and Big SQL script in Eclipse. __1. Create a BigInsights project in Eclipse. From the Eclipse menu bar, click File > New > Other. Expand the BigInsights folder, select BigInsights Project, and then click Next. __2. Type myBigSQL in the Project name field, and then click Finish. __3. If you are not already in the BigInsights perspective, a Switch to the BigInsights perspective window opens. Click Yes to switch to the BigInsights perspective. __4. Create a new SQL script file. From the Eclipse menu bar, click File > New > Other. Expand the BigInsights folder, and select SQL Script, and then click Next. __5. In the New SQL File window, in the Enter or select the parent folder field, select myBigSQL. Your new SQL file is stored in this project folder. __6. In the File name field, type aFirstFile. The .sql file extension is added automatically. Click Finish.

IBM Software Hands On Lab Page 15 __7. In the Select Connection Profile window, select the Big SQL connection. The properties of the selected connection display in the Properties field. When you select the Big SQL connection, the Big SQL database-specific context assistant and syntax checks are activated in the editor that is used to edit your SQL file. __8. Click Finish to close the Select Connection Profile window. __9. In the SQL Editor that opens the aFirstFile.sql SQL file you created, add the following Big SQL comments: -- This is a simple SQL script. -- These are comments. -- A line that begins with 2 dashes is a comment line -- and is not part of the processed SQL statements. As the text indicates, lines that begin with two dashes are comments. It’s useful to comment your Big SQL code so that others can easily follow your logic. You will be using this file in later lessons. __10. Save aFirstFile.sql by using the keyboard short, CTRL-S.

IBM Software Page 16 Introduction to Big SQL 3.2 Create sample tables and load sample data In this lesson, you will create tables and load data from files in your local file system. The examples in this tutorial use Big SQL tables managed by Hive. The sample data you will use is located in the $BIGSQL_HOME/samples folder in the local file system of the InfoSphere BigInsights server. The $BIGSQL_HOME environment variable is set to the installed location, which is typically /opt/ibm/biginsights/bigsql/. The time range of the Sample Outdoor Company sample data is 3 years and 7 months, starting January 1, 2004 and ending July 31, 2007. The 43-month period reflects the history that is made available for analysis. The schema that is used in this tutorial is the GOSALESDW. It contains fact and dimension tables for the a variety of areas, such as: • Finance • Geography • Marketing • Personnel • Products • Sales • Time __1. Open a terminal window using a desktop icon or standard operating system facilities. __2. Change your directory to $BIGSQL_HOME/samples. Open and review the information in the README file. __3. Open a new terminal window and invoke the set up script provided: ./ -u biadmin -p biadmin

IBM Software Hands On Lab Page 17 __4. The script runs when you see this message: Loading the data .. will take a few minutes to complete .. Please check /var/ibm/biginsights/bigsql/temp/GOSALESDW_LOAD.OUT file for progress information. The script runs three files located in the $BIGSQL_HOME/samples/queries directory: GOSALESDW_drop.sql, GOSALESDW_ddl.sql, and GOSALESDW_load.sql. __5. When the script completes, use the Files tab of the Web console to navigate to the directory containing your new tables: hdfs://biginsights/hive/warehouse/gosalesdw.db __6. Optionally, return to your terminal window or use your system’s file browser to navigate to the $BIGSQL_HOME/samples/queries subdirectory. Locate the GOSALESDW_ddl.sql file and display its contents.

IBM Software Page 18 Introduction to Big SQL Observe that the CREATE TABLE statement contains all the standard SQL clauses you would expect when working with a relational DBMS. Note that there are some additional clauses as well – clauses that are unique to Hadoop, including a format specification. Because Hive, the data warehouse platform for Hadoop, can store a variety of data formats, additional information is often required when creating tables. The sales order method table (shown above) will store data in a row format with fields delimited by a tab (t) character. __7. Optionally, locate the GOSALESDW_load.sql in the $BIGSQL_HOME/samples/queries directory or your local file system and display its contents. Note that data will be loaded from local files (in the ../samples/data subdirectory) into the tables created by the previous script. The overwrite clause in each LOAD statement causes any data that existed in the table to be overwritten by the contents of data in the file. __8. Optionally, navigate to the $BIGSQL_HOME/samples/data directory of your local file system and display the GOSALESDW.DIST_INVENTORY_FACT.txt file. Note that each record contains multiple fields separated by tabs. This matches the specification of the DDL statement for the Big SQL table.

IBM Software Hands On Lab Page 19 3.3 Running basic Big SQL queries You now have tables populated with data about the Sample Outdoor Company. In this lesson, you will explore some basic Big SQL queries and begin to understand the sample data. __1. From the Eclipse Project Explorer, open the myBigSQL project, and double-click the aFirstFile.sql file. __2. In the SQL editor pane, type the following statement: SELECT * FROM GOSALESDW.GO_REGION_DIM; Each complete SQL statement must end with a semi-colon. The statement selects, or fetches, all the rows that exist in the GO_REGION_DIM table. __3. Click the Run SQL icon (a green button with an arrow in the upper right corner of your file). Depending on how much data is in the table, a SELECT * statement might take some time to complete. Your result should contain 21 records or rows.

IBM Software Page 20 Introduction to Big SQL You might have a script that contains several queries. When you want to run the entire script, click the Run SQL icon. When you want to run a subset of statements, highlight the statement(s) that you want to run and press F5. __4. Refine the SELECT * statement by adding a predicate to the second statement to return fewer rows. SELECT * FROM GOSALESDW.GO_REGION_DIM WHERE REGION_EN LIKE 'Amer%'; __5. Click Run SQL to run the entire script. This query results in four records or rows. Tip: If you get an error, type out the full query instead of using copy/paste. __6. You can learn something more about the structure of the table with some queries to the Big SQL catalog tables, which provide metadata support to the database. The Big SQL catalog consists of four tables in the schema SYSCAT: schemas, tables, columns, and index columns. Type the following query, and this time, select the statement, and press F5. SELECT * FROM syscat.columns WHERE tablename= 'go_region_dim' AND schemaname= 'gosalesdw';

IBM Software Hands On Lab Page 21 This query uses two predicates in a WHERE clause. The query finds all of the information from the syscat.columns table as long as the tablename column value is go_region_dim and the schemaname column value is gosalesdw. Since you are using AND, both predicates must be true to return a row. Use single quotation marks around string values. The result of the query to the syscat.columns table is the metadata, or the structure of the table. Look at the SQL Results tab in Eclipse to see your output. The SQL Results tab in Eclipse shows 54 rows as your output. That means that there are 54 columns in the table go_region_dim. __7. Next, run a query that returns the number of rows in a table. Type the following query, select the statement, and then press F5. SELECT COUNT(*) FROM gosalesdw.go_region_dim; The COUNT aggregate function returns the number of rows in the query result set. A row that includes only null values is included in the count. In this example, there are 21 rows in the go_region_dim table. __8. Another form of the count function is the COUNT (distinct <expression>) statement. As the name implies, you can determine the number of unique values in a column, such as region_en. Type this statement in your SQL file: SELECT COUNT (distinct region_en) FROM gosalesdw.go_region_dim;

IBM Software Page 22 Introduction to Big SQL The result is 5. This result means that there are five unique region names in English (the column name region_en). __9. Another useful statement in Big SQL is the LIMIT clause. The LIMIT clause specifies a limit on the number of output rows that are produced by the SELECT statement. Type this statement in your SQL file, select the statement, and then press F5: SELECT * FROM GOSALESDW.DIST_INVENTORY_FACT LIMIT 50; __10. Save your SQL file. 3.4 Analyzing the data with Big SQL Now that you've created some tables and issued some basic Big SQL queries, let's explore a few scenarios that are a bit more sophisticated. In this lesson, you will create and run Big SQL queries that join data from multiple tables as well as perform aggregations and other SQL operations. Note that the queries included in this section are based on sample queries shipped with BigInsights in the $BIGSQL_HOME/samples/queries directory. __1. Create a new SQL file named companyInfo.sql in your project. From the Eclipse menu, click File > New > Other. In the Select a wizard window, expand the BigInsights folder, select SQL Script from the list of wizards, and then click Next. In the New SQL File window, select the myBigSQL project folder in the Enter or select the parent folder field. In the File name field, type companyInfo. The .sql file extension is added automatically. Click Finish. To learn what products were ordered from the Sample Outdoor Company, and by what method they were ordered, you must join information from multiple tables. __2. Type the following comments and statement into the companyInfo.sql file: -- Fetch the product name and the quantity and -- the order method. -- Query 1 SELECT pnumb.product_name, sales.quantity,

IBM Software Hands On Lab Page 23 meth.order_method_en FROM gosalesdw.sls_sales_fact sales, gosalesdw.sls_product_dim prod, gosalesdw.sls_product_lookup pnumb, gosalesdw.sls_order_method_dim meth WHERE pnumb.product_language='EN' AND sales.product_key=prod.product_key AND prod.product_number=pnumb.product_number AND meth.order_method_key=sales.order_method_key; Because there is more than one table reference in the FROM clause, the query can join rows from those tables. A join predicate specifies a relationship between at least one column from each table to be joined. • The predicates such as prod.product_number=pnumb.product_number helps to narrow the results to product numbers that match in two tables. • This query also uses an alias in the SELECT and FROM clauses, such as pnumb.product_name. pnumb is the alias for the gosalesdw.sls_product_lookup table. That alias can now be used in the where clause so that you do not need to repeat the complete table name, and the WHERE clause is not ambiguous. • The use of the predicate and pnumb.product_language=’EN’ helps to further narrow the result to only English output. This database contains thousands of rows of data in various languages, so restricting the language provides some optimization. __3. Highlight the statement, beginning with the keyword SELECT and ending with the semi-colon, and press F5. Review the results in the SQL Results page. You can now begin to see what products are sold and how they are ordered by customers.

IBM Software Page 24 Introduction to Big SQL __4. By default, the Eclipse SQL Results page limits the output to only 500 rows. You can change that value in the Data Management preferences. However, to find out how many rows the query actually returns in a full Big SQL environment, type the following query into the companyInfo.sql file, then select the query, and then press F5: --Query 2 SELECT COUNT(*) --(SELECT pnumb.product_name, sales.quantity, -- meth.order_method_en FROM gosalesdw.sls_sales_fact sales, gosalesdw.sls_product_dim prod, gosalesdw.sls_product_lookup pnumb, gosalesdw.sls_order_method_dim meth WHERE pnumb.product_language='EN' AND sales.product_key=prod.product_key AND prod.product_number=pnumb.product_number AND meth.order_method_key=sales.order_method_key;

IBM Software Hands On Lab Page 25 The result for the query is 446,023 rows. __5. Update the query that is labeled --Query 1 to restrict the order method to equal only 'Sales visit'. Add the following string just before the semi-colon: AND order_method_en='Sales visit' __6. Select the entire --Query 1 statement, and press F5. The result of the query displays in the SQL Results page in Eclipse. The results now show the product and the quantity that is ordered by customers actually visiting a retail shop. __7. To find out which method of all the methods has the greatest quantity of orders, you must add a GROUP BY clause (group by pll.product_line_en, md.order_method_en). You will also use a SUM aggregate function (sum(sf.quantity)) to total the orders by product and method. In addition, you can clean up the output a bit by using an alias (as Product) to substitute a more readable column header. -- Query 3

IBM Software Page 26 Introduction to Big SQL SELECT pll.product_line_en AS Product, md.order_method_en AS Order_method, sum(sf.QUANTITY) AS total FROM gosalesdw.sls_order_method_dim AS md, gosalesdw.sls_product_dim AS pd, gosalesdw.sls_product_line_lookup AS pll, gosalesdw.sls_product_brand_lookup AS pbl, gosalesdw.sls_sales_fact AS sf WHERE pd.product_key = sf.product_key AND md.order_method_key = sf.order_method_key AND pll.product_line_code = pd.product_line_code AND pbl.product_brand_code = pd.product_brand_code GROUP BY pll.product_line_en, md.order_method_en; __8. Select the complete statement, and press F5. Your results in the SQL Results page show 35 rows. 3.5 Optional: Using Big SQL from a JDBC client application You can write a JDBC client application that uses Big SQL to open a database connection, execute queries, and process the results. In this optional exercise, you'll see how writing a client JDBC application for Big SQL is like writing a client application for any relational DBMS that supports JDBC access. __1. In the IBM InfoSphere BigInsights Eclipse environment, create a Java project by clicking File > New >Project. From the New Project window, select Java Project. Click Next.

IBM Software Hands On Lab Page 27 __2. Type a name for the project in the Project Name field, such as MyJavaProject. Click Next. __3. Open the Libraries tab and click Add External Jars. Select the Big SQL JDBC driver (bigsql- jdbc-driver.jar) from your local path $BIGSQL_HOME/jdbc. __4. Click Finish. Click No when you are asked if you want to open a different perspective. __5. Right-click the MyJavaProject project, and click New > Package. In the Name field, in the New Java Package window, type a name for the package, such as aJavaPackage4me. Click Finish.

IBM Software Page 28 Introduction to Big SQL __6. Right-click the aJavaPackage4me package, and click New > Class. __7. In the New Java Class window, in the Name field, type SampApp. Select the public static void main(String[] args) check box. Click Finish. __8. Copy or type the following code into the file: package aJavaPackage4me; //a. Import required package(s) import java.sql.*;

IBM Software Hands On Lab Page 29 public class SampApp { /** * @param args */ //b. set JDBC & database info static final String db = "jdbc:bigsql://bivm:7052/default"; static final String user = "biadmin"; static final String pwd = "biadmin"; public static void main(String[] args) { Connection conn = null; Statement stmt = null; System.out.println("Started sample JDBC application."); try{ //c. Register JDBC driver Class.forName(""); //d. Get a connection conn = DriverManager.getConnection(db, user, pwd); System.out.println("Connected to the database."); //e. Execute a query

IBM Software Page 30 Introduction to Big SQL stmt = conn.createStatement(); System.out.println("Created a statement."); String sql; sql = "select * from gosalesdw.sls_product_dim " + "where product_key=30001"; ResultSet rs = stmt.executeQuery(sql); System.out.println("Executed a query."); //f. Obtain results System.out.println("Result set: "); while({ //Retrieve by column name int product_key = rs.getInt("product_key"); int product_number = rs.getInt("product_number"); //Display values System.out.print("* Product Key: " + product_key + "n"); System.out.print("* Product Number: " + product_number + "n"); } //g. Close open resources rs.close(); stmt.close(); conn.close(); }catch(SQLException sqlE){ // Process SQL errors sqlE.printStackTrace();

IBM Software Hands On Lab Page 31 }catch(Exception e){ // Process other errors e.printStackTrace(); }finally{ // Ensure resources are closed before exiting try{ if(stmt!=null) stmt.close(); }catch(SQLException sqle2){ } // nothing we can do try{ if(conn!=null) conn.close(); }catch(SQLException sqlE){ sqlE.printStackTrace(); }// end finally block }// end try block System.out.println("Application complete"); }} __a. After the package declaration, ensure that you include the packages that contain the JDBC classes that are needed for database programming. __b. Set up the database information so that you can refer to it. __c. Then, register the JDBC driver so that you can open a communications channel with the database. __d. Open the connection. __e. Run a query by submitting an SQL statement to the database. __f. Extract data from result set. __g. Clean up the environment by closing all of the database resources. __9. Save the file and right-click the Java file and click Run > Run as > Java Application.

IBM Software Page 32 Introduction to Big SQL __10. The results show in the Console view of Eclipse: Started sample JDBC application. Connected to the database. Created a statement. Executed a query. Result set: * Product Key: 30001 * Product Number: 1110

IBM Software Hands On Lab Page 33 Lab 4 Optional: Working with SerDes While data structured in CSV and TSV columns are often stored in BigInsights and loaded into Big SQL table, you may also need to work with other types of data – data that might require the use of a serializer / deserializer (SerDe). SerDes are common in the Hive and HBase communities. You’ll find a number of SerDes available in the public domain, or you can write your own following typical Hadoop practices. Using a SerDe with Big SQL is pretty straightforward. Once you develop or locate the SerDe you need, just add its JAR file to the /userlib subdirectory of your Big SQL home directory. Then stop and restart the Big SQL service, and specify the SerDe class name when you create your table. (Note: If needed, look in your JAR file to determine the class name of the SerDe you’ll be using. The CREATE TABLE statement requires the class name, not the JAR file name.) In this lab exercise, you will use a SerDe to define a Hive-managed table for blog data collected in a JSON (JavaScript Object Notation) format. JSON files have a nested, varied structure defined by the user or application that created them. The JSON-based blog file was collected by a BigInsights sample application that collects social media data from various public Web sites. The sample data is available for free download as part of a developerWorks article on Analyzing Social Media and Structured Data with InfoSphere Biginsights. Before beginning this lab, be sure that you have a copy of the blogs-data.txt file stored in your local file system. After you complete the lessons in this module, you will understand how to: • Register a SerDe with Big SQL and Hive • Create a Big SQL table that uses a SerDe for processing JSON data • Populate a Big SQL table with JSON data • Query this Big SQL table Allow 15 – 30 minutes to complete this lab. 4.1 Register a SerDe with Big SQL and Hive In this exercise, you will provide a JSON-based SerDe to Big SQL and Hive so that you can later create a table that relies on this SerDe. Note: If your instructor has provided you with a USB drive containing the hive-json-serde-0.2.jar and the blogs-data.txt files, you can skip the first two steps. __1. Download the .zip file containing the sample data from the bottom half of the article referenced in the introduction. Unzip the file into a directory on your local file system, such as /home/biadmin/sampleData/Watson. You will be working with the blogs-data.txt file. __2. Download the hive-json-serde-0.2.jar into a directory of your choice on your local file system, such as /home/biadmin/sampleData. __3. Ask your instructor if the SerDe file is installed on your VM. If not, complete the following sub- steps: __a. Stop the Big SQL server. (If you'd like, you can use the Cluster Status tab of the BigInsights Web console to do this.)

IBM Software Page 34 Introduction to Big SQL __b. Copy the SerDe .jar file to $BIGSQL_HOME/userlib and $HIVE_HOME/lib directories. __c. Restart the Big SQL server. (You can use the Cluster Status tab of the Web console.) 4.2 Creating, populating, and querying a table that uses a SerDe Now that you’ve registered your SerDe, you’re ready to use it. In this section, you will create a table that relies on the SerDe you just registered. Then you’ll load some sample JSON data in to the table and query its contents. You’ll see that only the first task – creating a table – requires you to be aware of the SerDe. __4. If desired, create a new BigInsights project in Eclipse for your work. (You can also use an existing project if you’d prefer.) __5. Create a new SQL script within your project named serdeTest.sql. __6. Copy and paste the following CREATE TABLE statement into your script: create table if not exists testblogs ( Country String, Crawled String, FeedInfo String, Inserted String, IsAdult int, Language String, Postsize int, Published String, SubjectHtml String, Tags String, Type String, Url String) row format serde 'org.apache.hadoop.hive.contrib.serde2.JsonSerde' stored as textfile;

IBM Software Hands On Lab Page 35 __7. Run the statement. (E.g., highlight it and click F5.) __8. Verify that the statement executed successfully. Inspect the results in your Eclipse console or use the Files tab of the Web console to navigate to the location of your Hive database. (By default, this is /biginsights/hive/warehouse.) __9. Next, load the blogs data into your table. Copy and paste the following statement into your SQL script, adjusting the local file path specification as needed to match the location where you placed the sample blogs-data.txt file: load hive data local inpath '/home/biadmin/sampleData/Watson/blogs-data.txt' overwrite into table testblogs; Note that this statement loads data from the file in your local file system into the testblogs table you created earlier. If that table had already contained data, this command would overwrite its contents with data found in the specified file. As you can see, there’s nothing particularly special about this LOAD statement – i.e., you don’t need to reference the SerDe in any way. __10. Run the statement. (E.g., highlight it and click F5.) __11. Verify that the statement executed successfully. Inspect the results in your Eclipse console or use the Files tab of the Web console to expand the testblogs folder in your Hive database. (By default, you will find this in /biginsights/hive/warehouse/testblogs.) You should find a copy of you input file there. About this code . . . . The CREATE TABLE statement specifies the class in the SerDe .jar file that is responsible for processing the input record into a "row" that Hive (and Big SQL) can understand. Because you copied the SerDe .jar file into the appropriate Big SQL and Hive directories earlier, the runtime engine will be able to locate this class in the .jar file and successfully execute the CREATE TABLE statement. Quite commonly, new users will specify the .jar file in the CREATE TABLE statement instead of the class file. Doing so will result in a runtime error.

IBM Software Page 36 Introduction to Big SQL __12. Add a simple SELECT statement to your SQL script to query the table. For example, the following statement will retrieve 5 rows from the table: select * from testblogs limit 5; Note that the SELECT syntax does not reference the SerDe in any way. __13. Run the statement (e.g., highlight it and click F5) and inspect the results.

IBM Software Hands On Lab Page 37

IBM Software Page 38 Introduction to Big SQL Lab 5 Summary Congratulations! You’ve just learned the basics of using SQL to work with data managed by Hadoop. To expand your skills and learn more, enroll in free online courses offered by Big Data University or work through free tutorials included in the BigInsights product documentation. A public wiki contains links to these and other resources, including free software downloads, a public forum, and a collection of videos and demos.



© Copyright IBM Corporation 2014. The information contained in these materials is provided for informational purposes only, and is provided AS IS without warranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, these materials. Nothing contained in these materials is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the use of IBM software. References in these materials to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates. This information is based on current IBM product plans and strategy, which are subject to change by IBM without notice. Product release dates and/or capabilities referenced in these materials may change at any time at IBM’s sole discretion based on market opportunities or other factors, and are not intended to be a commitment to future product or feature availability in any way. IBM, the IBM logo and are trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at

Add a comment

Related presentations

Related pages

Introducing the Microsoft Analytics Platform System ...

... the turnkey appliance for big data ... (13) March 2014 (15 ... seamlessly unifying the data in SQL Server PDW with data in Hadoop is ...
Read more

Get big results from big data. - Apache Hadoop ...

MapR is the Leader in Apache Hadoop Technology for Big Data Deployments. ... Solution Areas; ... Why MapR; Why Hadoop; SQL on Hadoop;
Read more

Apache Spark™ - Lightning-Fast Cluster Computing

Apache Spark is a fast and general engine for big data ... In-person events include the Bay Area Spark meetup and ... Spark Summit 2014 contained ...
Read more

PoweredBy - Hadoop Wiki - Apache Software Foundation

Benipal Technologies - Big Data. ... We are using Hadoop in our data ... Apache Pig and map/reduce to process extracted SQL data to ...
Read more

Impala - The Platform for Big Data and the Leading ...

Bay Area Impala User Group ... scalability strengths of Hadoop - combining the familiar SQL support and multi-user ... analytics across all data in Hadoop ...
Read more

Big Data - YouTube

... 13. Play next; ... Analyzing Big Data with Twitter ... Doug Cutting Washington DC Hadoop Meetup by Washington DC Hadoop Users Group. 1:27:28.
Read more

Cloudera Training

Cloudera University is the leading provider of Apache Hadoop education, with classroom training in more than 50 cities around the world and online
Read more

Microsoft ships CTP of Hadoop Connectors for SQL Server ...

... (13) March 2014 (15 ... structured and unstructured data. Hadoop Connector for SQL Server ... of complex unstructured data i.e. ‘Big Data’ will ...
Read more