From the course: Complete Guide to Advanced SQL Server
Manipulate a DataFrame - SQL Server Tutorial
From the course: Complete Guide to Advanced SQL Server
Manipulate a DataFrame
The easiest way to start working with a data set in Python is to first figure out the SELECT query that will retrieve the data from your database that you want to work with. I'm going to go ahead and take a look inside of the Wide World Importers database. And in the Tables folder, we'll see that there's a table called Application.Cities. I'm going to create a new query to pull some information out of this table. I'm going to pull out the distinct city name and state province IDs from it. And I also want to order the records by the city name values. I can then execute the query. I get the results down below. And this is the data that I want to work with in Python. Now, we can go ahead and select all three of these lines and cut them to my clipboard. Now, we can start writing the Python script, and we're going to start with the same empty script template that we've been working with. We'll start with EXEC or execute sp_execute_external_script. Then we have our two parameters here, @ language, and that's going to be set to the value of Python. Next up, we have the script parameter. And I'm going to leave that one blank here for the time being. And now we're going to add one more parameter. So I'm going to type in a comma after the close of the script parameter, and we'll come down to line number six, and the next parameter is called @input_data _1. Remember to include underscores between the words "input," "data," and "1." I'm going to set its value. It's also a Unicode character. So I'm going to type in the capital letter N and open up a single quotation mark. And now I can paste in the contents of the SELECT statement that we just generated. I'll finish the parameter by closing it with a single quotation mark here at the very end. The input_data_1 property is going to take the results of the SELECT statement and convert it into a pandas DataFrame and store it in a variable called InputDataSet. Now, we can use the data in the Python script by referring to the InputDataSet variable. Let's start by simply printing the contents of the DataFrame to the Messages window. I'll come back up here to line number four and I'm going to use the print function. And I want to print InputDataSet. The variable name doesn't have any spaces or underscores in it, and it uses a capital letter I, capital D, and a capital S. Let's execute the Python script and see what the results look like. That'll open up the Messages window. And I can see that I have a column for the CityName and a column for the StateProvinceID data. On the left are integers that are used to identify each row of the DataFrame. Notice that the first row is zero. Python uses zero-based indexing, and we can use these index numbers to pull individual rows from the DataFrame using pandas DataFrame properties. One property is called iloc or I-L-O-C, which stands for integer location. You can use the iloc property by typing a period after the DataFrame name, which in our case is InputDataSet, type the period there and then iloc. After that, I need to open up two sets of square brackets, followed by the integer that represents the index position of the record that I'd like to return. How about number seven? We'll close the two square brackets and now we can execute this script. That returns just the single row from the DataFrame. We can retrieve multiple rows by separating the index integers with a comma. So I'll come back up here after seven, I'll type in a comma. How about also number 10 and number 15? We'll execute the commands here. And I get those three records returned. We can also retrieve a range of rows by using only a single set of square brackets. I'm going to highlight all of this here and get rid of it. And this time I'll type in the range of 5 to 15. To specify the range, we can use a colon between those two integers. I'd like to get the statement and we get the results here. Notice that the range is not inclusive. It starts at index number five but it does not include index 15. It stops at 14. This is similar behavior to what we saw with the for loop in the last chapter. Now, if you're just interested in seeing the first several rows from the DataFrame, you can use the head function instead of iloc. Go ahead and highlight all of this here. And instead of InputDataSet.iloc, I'm going to type in InputDataSet.head. Since this is a function, DataFrame dot head uses parentheses to define how many rows to retrieve from the beginning of the DataFrame. Let's retrieve the first 10 rows. I'll type in a 10 in parentheses. We'll close the parenthesis and we'll have that second closing parenthesis here for the print function. I'll execute the statement and I get the first 10 rows returned from the DataFrame. For more general information about the DataFrame, you can use a property called columns to see just what the column labels are. Let's change this print statement to InputDataSet.columns. I'll execute it and I'll see that the two columns are CityName and StateProvinceID. And finally, you might also want to know how many rows and columns are in the DataFrame. You can get that information with the shape property. Now, I'll print InputDataSet.shape, and in the Messages window, we'll see that we have 37,940 rows and two columns. So there's a number of different ways that we can start working with the DataFrame in a Python script. Once you've loaded the InputDataSet variable from a SQL Server SELECT statement using the input_data_1 property, you can then manipulate it with the properties and functions available in the pandas library.
Practice while you learn with exercise files
Download the files the instructor uses to teach the course. Follow along and learn by watching, listening and practicing.