From the course: Exploring Data Science with .NET using Polyglot Notebooks & ML.NET

Introducing DataFrames

- [Instructor] Let's take a look at DataFrames in Polyglot Notebooks. DataFrames are great little objects that can let you work with tabular data sources, such as comma separated value or CSV files or tab delimited files. DataFrames can even work with the results of SQL queries to relational databases. We're going to get started here by looking at DataFrames. We've already gone ahead and imported the Microsoft Data Analysis NuGet package. We need this in order to work with our DataFrames objects. Next, we're going to load up a customer CSV file that I have locally here on disk, and we're going to use the data frame to do that. So here, I have a call to load CSV. I'm calling DataFrame.LoadCsv. I'm telling it the file I want to read. In this case, it's the customer's CSV file. I'm telling it what separator to use, in this case, a comma. That's the default value, but here I'm specifying it explicitly. If I wanted to work with tab delimited files, I could use a back /t to represent a tab. If I wanted to work with pipes, I could replace the comma with a pipe. The header parameter tells the DataFrame that the first row of data contains the names of each column. So don't treat that as actual data, treat that as the source of the names of the columns that you're loading up. The guessRows parameter is optional and defaults to 10, but what that does is it helps the DataFrame understand what type of data is in each column. You won't usually need to set guessRows to anything, but if you have a lot of null values at the beginning of your file, you might see some exceptions trying to load your data if it can't determine what type of data is in that particular column. So increasing the guessRows parameter can help. Once you have your parameters set, you can go ahead and run this, and we can see what's in our DataFrame. Here we see it loaded 5,130 rows into our DataFrame. And it contained a number of different columns, including information about employees, job titles, order numbers, customers, product categories, and more. We can scroll through this, but we can clearly see this as a lot of data. And that can be a little hard to work through. Thankfully, we have some nice handy controls at the bottom. It gives us the ability to page through our data. And we can also scroll horizontally to see additional columns. I find looking at the DataFrame this way to be helpful, but maybe not the best experience. A better experience is working directly with certain methods. For example, the head method will help me see the first few rows of my DataFrame. So if I wanted to see the first row of my DataFrame, I'll say Head(1), and this shows me each of the columns as well as the first row in my DataFrame. Conversely, if I wanted to see the last few rows of my DataFrame, I could add a new cell and do that here. So I'm going to say DataFrame.Tail, to get the last few rows. I'm going to see the last three rows. And we can see these are the last three rows of my data file. Now, sometimes the data has a particular order to it, and you don't necessarily want to see things at the beginning of your data or at the end of your data. Sometimes you want to see things at random. For that, there's a sample method. So if I say DataFrame.Sample, and I give it the number of rows I want to see, and I run this, I'm going to get a random sampling of rows from my DataFrame. So here, I see rows from anywhere in my DataFrame and I can see their values. This is really handy for seeing data just at random to get a good sense of the data that's actually in my DataFrame. And you could repeat sample a number of times and you'll get different results every time. We're going to take a lot more looks at what's in the DataFrame and what you can do with it, but this is a good first taste of how we can work with data using our DataFrame.

Contents