From the course: SQL for Data Analysis

Using data types and identifying the wrong data types - SQL Tutorial

From the course: SQL for Data Analysis

Using data types and identifying the wrong data types

- Now, let's take a look at what data types are and how we can prevent storing and processing the wrong data. Let's review. A relational database is an organization of related tables. A table consists of rows and columns with each row representing a record or an instance. A column represents an attribute of each instance. However, all fields are not made the same. The data type of a column defines what value the column can hold. Much like other programming languages, each attribute or column has a data type that helps the system and the programmer know how to interpret the value and how to process it. Here's an example. Let's say I want to order a pet carrier. Depending on the type of pet I have, like a dog, cat or bird and the size of my pet, I'm going to buy the carrier that best fits. Also, depending on their needs I might need different features. Data types work the same. They help us store the right value to make sure the application works correctly. There are many different data types to choose from, and depending on the system, data types and their corresponding functions can perform differently. Some common data types in SQL are INT or integer, a VARCHAR, or you see NVARCHAR as a derivative of that, DATE, DATETIME, FLOAT, Decimal and DOUBLE. Also note that different systems offer different data types. When defining a column's data type, for some systems, you can include the maximum number of characters or digits allowed in the column. These restrictions are known as constraints. This is defined when the table is created. The developer or administrator can also define additional column constraints. When defining tables, we can define whether the data can be NULL or NOT NULL in each column. We can also ensure that all values in the column are UNIQUE. This is also where we define PRIMARY KEYs, FOREIGN KEYs and set DEFAULT values for each column. Here's an example of a CREATE TABLE statement. Don't worry about the syntax for right now. We'll go over that in more detail later. Let's look at the table's data types and constraints. Here we have the CustomerID column. You can specify an integer of 4 data type for CustomerID. That'll be good for about 9,999 customer IDs. If the system tries to auto increment greater than 9,999 or a customer ID has more than four digits, the process will error. It's important to work with the database team to ensure enough space for customer IDs that are larger than four digits, such as an INT or a SMALLINT data type. For example, in my SQL, the constraints on our integer values are TINYINT, SMALLINT, MEDIUMINT and BIGINT. Data types and constraints ensure that the data is kept consistent and has good quality. Let's look at some more examples. I want to calculate the total amount a customer has ever spent with H+ Sport. I've typed out a query in advance so we can go over the main concepts you need to know for now. If there are some key words you don't recognize, it's okay. I've selected the CustomerID, FirstName, LastName from the Customer table, and the TotalDue from the Orders table. And the results from that query gives us the total due for each customer. However, that's not exactly the question we want to answer. We need to apply a function called SUM to add the total due for each customer ID. We can call this function using the GROUP BY clause and the SUM function. This allows us to aggregate the total due for each customer ID. Again, we'll cover aggregates in a minute. It's a great feature, and I won't forget to come back to this. Let's click on Run on active connection, and here are our query results. Be sure that when using the SUM operator, that you're working with a number data type. Let's try the same query with a text data type like customer email. Looks like we didn't encounter an error, but let's look at the results. We have zeros for the value for some email. The data type for email will not error but will also not give us the correct results. In up database management systems such as SQL Server, this would result in an error. The data type character does not allow for the use of the SUM function. There are other functions we can use to count the number of emails in this case. I'll click on the Top10Customers, that SQL file. I'm going to close this query result. I'll add some additional code to select the top 10 customers by adding the LIMIT 10 and order by keywords. Scroll over a little bit to see the results. It works. We have our top 10 customers ordered by total due. Also note the IDs that we used to join the data together should also be of the same data type. For example, you can store customer ID 001 as VARCHAR or INT. Whatever you decide, the data type needs to be consistent across tables to prevent any future errors that might occur. Take a look at a few of the links for data types for MySQL and SQL Server for more information on data types that best fit the data that you are working with.

Contents