A Practical Introduction to Polars for Python Data Processing

Polars is a "lightning-fast DataFrame library for Rust and Python" - if you're familiar with Pandas, Polars is quite similar but significantly faster and with a cleaner API. This post is a brief tour of how to set up and use Polars, working through a pragmatic example of the things you might do in a typical data exploration project.

The Setup

Install polars and jupyter:

python3 -m venv venv
source venv/bin/activate
pip install polars notebook

This sets up a python virtual environment, then installs the Polars library as well as jupyter. The page you're looking at was created as a jupyter notebook.

The Data

I'm using the New York Taxi Dataset parquet files in this notebook. I downloaded the October 2022 Yellow Taxi Trip Records (PARQUET) file, feel free to grab that as well.

The data is in parquet format, which is quite a nice format for data processing, significantly better than CSV.

Polars

Let's import polars and read the file:

import polars as pl

fname = "yellow_tripdata_2022-10.parquet"
data = pl.read_parquet(fname)

Let's see what it looks like

data.head(5)

shape: (5, 19)

VendorID	tpep_pickup_datetime	tpep_dropoff_datetime	passenger_count	trip_distance	RatecodeID	store_and_fwd_flag	PULocationID	DOLocationID	payment_type	fare_amount	extra	mta_tax	tip_amount	tolls_amount	improvement_surcharge	total_amount	congestion_surcharge	airport_fee
i64	datetime[ns]	datetime[ns]	f64	f64	f64	str	i64	i64	i64	f64	f64	f64	f64	f64	f64	f64	f64	f64
1	2022-10-01 00:03:41	2022-10-01 00:18:39	1.0	1.7	1.0	"N"	249	107	1	9.5	3.0	0.5	2.65	0.0	0.3	15.95	2.5	0.0
2	2022-10-01 00:14:30	2022-10-01 00:19:48	2.0	0.72	1.0	"N"	151	238	2	5.5	0.5	0.5	0.0	0.0	0.3	9.3	2.5	0.0
2	2022-10-01 00:27:13	2022-10-01 00:37:41	1.0	1.74	1.0	"N"	238	166	1	9.0	0.5	0.5	2.06	0.0	0.3	12.36	0.0	0.0
1	2022-10-01 00:32:53	2022-10-01 00:38:55	0.0	1.3	1.0	"N"	142	239	1	6.5	3.0	0.5	2.05	0.0	0.3	12.35	2.5	0.0
1	2022-10-01 00:44:55	2022-10-01 00:50:21	0.0	1.0	1.0	"N"	238	166	1	6.0	0.5	0.5	1.8	0.0	0.3	9.1	0.0	0.0

What are the columns in this file?

data.columns

['VendorID',
 'tpep_pickup_datetime',
 'tpep_dropoff_datetime',
 'passenger_count',
 'trip_distance',
 'RatecodeID',
 'store_and_fwd_flag',
 'PULocationID',
 'DOLocationID',
 'payment_type',
 'fare_amount',
 'extra',
 'mta_tax',
 'tip_amount',
 'tolls_amount',
 'improvement_surcharge',
 'total_amount',
 'congestion_surcharge',
 'airport_fee']

Pretty easy to guess what the columns mean.

Let's poke around a little bit - what's the average taxi fare in NYC?

data["total_amount"].mean()

22.247540895418666

I wonder how many passengers there are in each ride?

data["passenger_count"].mean()

1.3846936672698376

How about the average fare from the airport? I noticed the airport_fee column, let's look at average total_amount when the airport fee is greater than zero:

data.filter( pl.col('airport_fee') > 0 )["total_amount"].mean()

55.83249747837022

How about the average non-airport ride cost?

data.filter( pl.col('airport_fee') == 0 )["total_amount"].mean()

18.957161390687517

Most expensive rides

Let's see what the most expensive fares in this data set are. We'll sort the data by total_amount and look at the top 10.

most_expensive = data\
    .sort("total_amount", reverse=True)\
    .head(10)\
    .select(["total_amount", "PULocationID", "DOLocationID"])

most_expensive

shape: (10, 3)

total_amount	PULocationID	DOLocationID
f64	i64	i64
1301.85	39	265
950.3	226	226
950.3	260	260
870.6	265	265
800.3	140	229
750.3	265	265
750.3	265	265
639.8	68	193
636.55	138	138
606.36	4	64

I'm selecting just the 3 columns I'm interested in so it's easier to look at the data - the amount and pick up and drop off locations.

Note the coding style above: I split the query into separate lines so it's easier to read. I've become a stickler when it comes to query / code style - you'll thank yourself for keeping it clean when you come back months later and try to figure out what you were doing.

We can make it even cleaner if we use the paranthesis syntax:

most_expensive = (
    data
    .sort("total_amount", reverse=True)
    .head(10)
    .select(["total_amount", "PULocationID", "DOLocationID"])
)

I wasn't crazy about the above syntax at first, but once I started using it I really liked it.

Also notice the style of computation here - we're chaining the actions, one after another, javascript style: sort, then take 10 rows, then select the columns we want. This I also like.

Looking at the data, apparently somebody paid \$1,301.85 for a single ride. I wonder where they went to? Let's find out.

PULocationID and DOLocationID are the pick up and drop off location IDs. They're just numbers right now, not too informative. Let's convert them to names.

Zone Names

We need a mapping from the Zone IDs to the Zone names. From the NYC TLC Data page, look for the Taxi Zone Lookup Table (CSV) link and download the CSV file.

zone_data = pl.read_csv("taxi+_zone_lookup.csv")
zone_data.head(10)

shape: (10, 4)

LocationID	Borough	Zone	service_zone
i64	str	str	str
1	"EWR"	"Newark Airport...	"EWR"
2	"Queens"	"Jamaica Bay"	"Boro Zone"
3	"Bronx"	"Allerton/Pelha...	"Boro Zone"
4	"Manhattan"	"Alphabet City"	"Yellow Zone"
5	"Staten Island"	"Arden Heights"	"Boro Zone"
6	"Staten Island"	"Arrochar/Fort ...	"Boro Zone"
7	"Queens"	"Astoria"	"Boro Zone"
8	"Queens"	"Astoria Park"	"Boro Zone"
9	"Queens"	"Auburndale"	"Boro Zone"
10	"Queens"	"Baisley Park"	"Boro Zone"

Ok, that's helpful, but I'd rather have the Zone and Borough together in a single column so it has all the info I need.

zone_data = (
    zone_data
    .with_column(
        (pl.col("Borough") + ": " + pl.col("Zone")).alias("Location"),
    )
    .select(["LocationID", "Location"])
)

zone_data.head(10)

shape: (10, 2)

LocationID	Location
i64	str
1	"EWR: Newark Ai...
2	"Queens: Jamaic...
3	"Bronx: Allerto...
4	"Manhattan: Alp...
5	"Staten Island:...
6	"Staten Island:...
7	"Queens: Astori...
8	"Queens: Astori...
9	"Queens: Auburn...
10	"Queens: Baisle...

Ok, that's looking pretty good. Now let's look at those expensive rides again, but add in the pick up and drop off location names.

To do this I'll join the most_expensive dataframe I created above, which just had the fare and pick up and drop off IDs, with the zone_data dataframe which has the zone names. First I'll join the PULocationID with the zone data, add the location name, and rename the column from Location to PickUpLocation, then I'll do the same with the DOLocationID.

The left join below means use the rows from the left dataframe, which is the most_expensive dataframe, and injects into those rows the Location names from the zone_data dataframe, joining on LocationID.

most_expensive_rides = (
    most_expensive
    .join(zone_data, how="left", left_on="PULocationID", right_on="LocationID").rename({"Location": "PickUpLocation"})
    .join(zone_data, how="left", left_on="DOLocationID", right_on="LocationID").rename({"Location": "DropOffLocation"})
)
most_expensive_rides

shape: (10, 5)

total_amount	PULocationID	DOLocationID	PickUpLocation	DropOffLocation
f64	i64	i64	str	str
1301.85	39	265	"Brooklyn: Cana...	"Unknown: NA"
950.3	226	226	"Queens: Sunnys...	"Queens: Sunnys...
950.3	260	260	"Queens: Woodsi...	"Queens: Woodsi...
870.6	265	265	"Unknown: NA"	"Unknown: NA"
800.3	140	229	"Manhattan: Len...	"Manhattan: Sut...
750.3	265	265	"Unknown: NA"	"Unknown: NA"
750.3	265	265	"Unknown: NA"	"Unknown: NA"
639.8	68	193	"Manhattan: Eas...	"Queens: Queens...
636.55	138	138	"Queens: LaGuar...	"Queens: LaGuar...
606.36	4	64	"Manhattan: Alp...	"Queens: Dougla...

The location names are getting cut off when we view the dataframe - let's configure the output to show wider columns

pl.Config. set_fmt_str_lengths(100)

most_expensive_rides

shape: (10, 5)

total_amount	PULocationID	DOLocationID	PickUpLocation	DropOffLocation
f64	i64	i64	str	str
1301.85	39	265	"Brooklyn: Canarsie"	"Unknown: NA"
950.3	226	226	"Queens: Sunnyside"	"Queens: Sunnyside"
950.3	260	260	"Queens: Woodside"	"Queens: Woodside"
870.6	265	265	"Unknown: NA"	"Unknown: NA"
800.3	140	229	"Manhattan: Lenox Hill East"	"Manhattan: Sutton Place/Turtle Bay North"
750.3	265	265	"Unknown: NA"	"Unknown: NA"
750.3	265	265	"Unknown: NA"	"Unknown: NA"
639.8	68	193	"Manhattan: East Chelsea"	"Queens: Queensbridge/Ravenswood"
636.55	138	138	"Queens: LaGuardia Airport"	"Queens: LaGuardia Airport"
606.36	4	64	"Manhattan: Alphabet City"	"Queens: Douglaston"

The Unknown: NA locations are not very useful, let's get rid of them. I'll filter the data so they're excluded.

Location ID of 256 means Unknown: NA, so we need to get rid of any row that has 256 in the PULocationID or DOLocationID columns.

clean_data = (
    data
    .filter( 
        (pl.col("PULocationID") != 265) &
        (pl.col("DOLocationID") != 265)
    )
)
clean_data.head(5)

shape: (5, 19)

VendorID	tpep_pickup_datetime	tpep_dropoff_datetime	passenger_count	trip_distance	RatecodeID	store_and_fwd_flag	PULocationID	DOLocationID	payment_type	fare_amount	extra	mta_tax	tip_amount	tolls_amount	improvement_surcharge	total_amount	congestion_surcharge	airport_fee
i64	datetime[ns]	datetime[ns]	f64	f64	f64	str	i64	i64	i64	f64	f64	f64	f64	f64	f64	f64	f64	f64
1	2022-10-01 00:03:41	2022-10-01 00:18:39	1.0	1.7	1.0	"N"	249	107	1	9.5	3.0	0.5	2.65	0.0	0.3	15.95	2.5	0.0
2	2022-10-01 00:14:30	2022-10-01 00:19:48	2.0	0.72	1.0	"N"	151	238	2	5.5	0.5	0.5	0.0	0.0	0.3	9.3	2.5	0.0
2	2022-10-01 00:27:13	2022-10-01 00:37:41	1.0	1.74	1.0	"N"	238	166	1	9.0	0.5	0.5	2.06	0.0	0.3	12.36	0.0	0.0
1	2022-10-01 00:32:53	2022-10-01 00:38:55	0.0	1.3	1.0	"N"	142	239	1	6.5	3.0	0.5	2.05	0.0	0.3	12.35	2.5	0.0
1	2022-10-01 00:44:55	2022-10-01 00:50:21	0.0	1.0	1.0	"N"	238	166	1	6.0	0.5	0.5	1.8	0.0	0.3	9.1	0.0	0.0

Ok, now we can look at the most expensive rides again, with the Unknown locations removed

most_expensive = (
    clean_data
    .sort("total_amount", reverse=True)
    .head(10)
    .select(["total_amount", "PULocationID", "DOLocationID"])
)

most_expensive

shape: (10, 3)

total_amount	PULocationID	DOLocationID
f64	i64	i64
950.3	226	226
950.3	260	260
800.3	140	229
639.8	68	193
636.55	138	138
606.36	4	64
590.05	238	1
550.3	100	48
505.3	76	76
502.8	79	132

Let's add the location names back in. I'll create a function to do this so we can use it whenever we need.

def add_location_names(data):
    return (
        data
        .join(zone_data, how="left", left_on="PULocationID", right_on="LocationID").rename({"Location": "PickUpLocation"})
        .join(zone_data, how="left", left_on="DOLocationID", right_on="LocationID").rename({"Location": "DropOffLocation"})
    )

add_location_names(most_expensive)

shape: (10, 5)

total_amount	PULocationID	DOLocationID	PickUpLocation	DropOffLocation
f64	i64	i64	str	str
950.3	226	226	"Queens: Sunnyside"	"Queens: Sunnyside"
950.3	260	260	"Queens: Woodside"	"Queens: Woodside"
800.3	140	229	"Manhattan: Lenox Hill East"	"Manhattan: Sutton Place/Turtle Bay North"
639.8	68	193	"Manhattan: East Chelsea"	"Queens: Queensbridge/Ravenswood"
636.55	138	138	"Queens: LaGuardia Airport"	"Queens: LaGuardia Airport"
606.36	4	64	"Manhattan: Alphabet City"	"Queens: Douglaston"
590.05	238	1	"Manhattan: Upper West Side North"	"EWR: Newark Airport"
550.3	100	48	"Manhattan: Garment District"	"Manhattan: Clinton East"
505.3	76	76	"Brooklyn: East New York"	"Brooklyn: East New York"
502.8	79	132	"Manhattan: East Village"	"Queens: JFK Airport"

Hmm. Somebody paid \$950.3 to go from Queens: Sunnyside to Queens: Sunnyside. Same for Woodside. There was also an \$800 ride from Lenox Hill East to Sutton Place, which according to Google is a 4 minute car ride.

I'm starting to think our data might have junk in it. Exciting - finding junk in you data means you're doing Data Science™. Now we just have to throw in some Excel and we'll have Machine Learning™.

There's a lot we could do from here. For example, we could calculate the average fare by pick up and drop off location. Or we can look at the impact of pick up time on fares (there's a tpep_pickup_datetime column in the data).

For some reason I decided it'd be interesting to see which pickup-dropoff location pairs have the highest variation in fare. We can figure this out by grouping the data by ("PULocationID", "DOLocationID"), then calculating the standard deviation for each grouping, and looking at the locations with the highest deviation.

amount_stddevs = (
    clean_data
    .groupby(["PULocationID", "DOLocationID"])
    .agg(
        [
            pl.col("total_amount").std().alias("total_amount_stddev")
        ]
    )
    .sort("total_amount_stddev", reverse=True)
    .head(20)

)

amount_stddevs

shape: (20, 3)

PULocationID	DOLocationID	total_amount_stddev
i64	i64	f64
16	175	219.899498
192	129	190.126871
260	223	155.620968
93	192	137.885822
98	98	137.461869
194	129	137.14336
55	132	130.249069
50	182	129.223764
260	129	122.231773
206	206	120.172797
68	193	119.954706
250	250	119.445285
19	19	118.451447
85	28	115.611959
137	201	114.268456
223	174	111.510739
92	252	110.662211
132	184	107.047076
164	187	104.368961
209	117	103.060813

Let's add the location names:

add_location_names(amount_stddevs)

shape: (20, 5)

PULocationID	DOLocationID	total_amount_stddev	PickUpLocation	DropOffLocation
i64	i64	f64	str	str
16	175	219.899498	"Queens: Bayside"	"Queens: Oakland Gardens"
192	129	190.126871	"Queens: Queensboro Hill"	"Queens: Jackson Heights"
260	223	155.620968	"Queens: Woodside"	"Queens: Steinway"
93	192	137.885822	"Queens: Flushing Meadows-Corona Park"	"Queens: Queensboro Hill"
98	98	137.461869	"Queens: Fresh Meadows"	"Queens: Fresh Meadows"
194	129	137.14336	"Manhattan: Randalls Island"	"Queens: Jackson Heights"
55	132	130.249069	"Brooklyn: Coney Island"	"Queens: JFK Airport"
50	182	129.223764	"Manhattan: Clinton West"	"Bronx: Parkchester"
260	129	122.231773	"Queens: Woodside"	"Queens: Jackson Heights"
206	206	120.172797	"Staten Island: Saint George/New Brighton"	"Staten Island: Saint George/New Brighton"
68	193	119.954706	"Manhattan: East Chelsea"	"Queens: Queensbridge/Ravenswood"
250	250	119.445285	"Bronx: Westchester Village/Unionport"	"Bronx: Westchester Village/Unionport"
19	19	118.451447	"Queens: Bellerose"	"Queens: Bellerose"
85	28	115.611959	"Brooklyn: Erasmus"	"Queens: Briarwood/Jamaica Hills"
137	201	114.268456	"Manhattan: Kips Bay"	"Queens: Rockaway Park"
223	174	111.510739	"Queens: Steinway"	"Bronx: Norwood"
92	252	110.662211	"Queens: Flushing"	"Queens: Whitestone"
132	184	107.047076	"Queens: JFK Airport"	"Bronx: Pelham Bay Park"
164	187	104.368961	"Manhattan: Midtown South"	"Staten Island: Port Richmond"
209	117	103.060813	"Manhattan: Seaport"	"Queens: Hammels/Arverne"

That's... less interesting than I thought it would be. I suppose the absolute values are pretty interesting - our top hit, Queens: Bayside to Queens: Oakland Gardens, is about a 5 minute car ride and has a standard deviatiion of \$220. That seems suspect.

This much standard deviation means we have bad data.

Data Exploration and Cleansing

A lot of real-world data work involves cleaning data and finding outliers, so let's try that with this data.

There are a number of techniques for finding outliers, but first let's take a better look at our data. Polars has a built-in describe function that calculates basic statistics for your dataframe:

data.describe()

shape: (7, 20)

describe	VendorID	tpep_pickup_datetime	tpep_dropoff_datetime	passenger_count	trip_distance	RatecodeID	store_and_fwd_flag	PULocationID	DOLocationID	payment_type	fare_amount	extra	mta_tax	tip_amount	tolls_amount	improvement_surcharge	total_amount	congestion_surcharge	airport_fee
str	f64	str	str	f64	f64	f64	str	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64
"count"	3.675411e6	"3675411"	"3675411"	3.675411e6	3.675411e6	3.675411e6	"3675411"	3.675411e6	3.675411e6	3.675411e6	3.675411e6	3.675411e6	3.675411e6	3.675411e6	3.675411e6	3.675411e6	3.675411e6	3.675411e6	3.675411e6
"null_count"	0.0	"0"	"0"	133021.0	0.0	133021.0	"133021"	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	133021.0	133021.0
"mean"	1.720443	null	null	1.384694	6.206977	1.424337	null	165.498365	162.891786	1.188275	15.299953	0.987793	0.487229	2.859254	0.583306	0.295531	22.247541	2.28062	0.103128
"std"	0.480032	null	null	0.930231	640.823768	5.685591	null	65.213026	70.197474	0.540651	14.674444	1.242828	0.101155	3.374175	2.111967	0.051038	18.368245	0.75784	0.347974
"min"	1.0	"2008-12-31 23:02:01.000000000"	"2009-01-01 02:20:59.000000000"	0.0	0.0	1.0	"N"	1.0	1.0	0.0	-1294.5	-22.18	-0.5	-100.01	-70.0	-0.3	-1301.85	-2.5	-1.25
"max"	6.0	"2022-11-01 01:27:35.000000000"	"2022-11-03 17:26:46.000000000"	9.0	389678.46	99.0	"Y"	265.0	265.0	4.0	1294.5	10.8	25.48	500.0	516.75	1.0	1301.85	2.5	1.25
"median"	2.0	null	null	1.0	1.9	1.0	null	162.0	162.0	1.0	10.5	0.5	0.5	2.22	0.0	0.3	16.55	2.5	0.0

That's pretty helpful - for example, the minimum pick up date is 2008-12-31, which means this data starts all the way back to 2009. Because of the way the TLC data download page is organized I thought I only had data from October 2023.

The largest tip amount is \$500 (max of the tip_amount column). The minimum tip is \$-100.01. Must've been a pretty rough ride.

Looking at the total_amount column we see that the maximum ride amount was \$1301.85 and the minimum was -\$1301.85. I don't know much about this data, but I'm going to guess that's an incorrectly charged ride and a refund. That's unfortunate - having large invalid data (\$1301.85 and -\$1301.85) at either extreme of values will wreak havoc on our standard deviations and such.

The longest ride (max of the trip_distance field) is 389678.46. Is that in miles? Let's look at the data dictionary provided by the TLC.

Screenshot%202023-01-16%20at%207.15.01%20AM.png

It is in fact in miles. The length of the United States is about 3000 miles, so this enterprising taxi travelled the entire US about 130 times.

The data dictionary tells us a lot of other useful things - for example, the Payment_type column should help us find disputed, voided, and other invalid charges.

Basic Cleaning

So what should we do? I'm thinking we look at more recent data, say anything 2020 and after. Let's get rid of anything with more than 3000 miles, and let's also get rid of rows with unreasonable negative numbers - eg. negative total amounts, tolls, etc. We can get rid of rows where the Payment_type is No Charge, Dispute, Voided, etc.

The trouble is if we just get rid of the negative amounts and disputes then we're leaving the corresponding equally invalid positive amount in the data - eg. if we drop the row that has the -\$1301.85 amount we'll still have the \$1301.85 row in our data.

We'll have to figure out how to drop both: for each of the negative numbers, we'll need to find the corresponding positive amount and get rid of those rows too.

Let's take a look at the outlier data and see if spot any clues for how to get rid of them.

(
    data
    .filter(
        (pl.col("total_amount") > 1300) |
        (pl.col("total_amount") < -1300)
    )
)

shape: (2, 19)

VendorID	tpep_pickup_datetime	tpep_dropoff_datetime	passenger_count	trip_distance	RatecodeID	store_and_fwd_flag	PULocationID	DOLocationID	payment_type	fare_amount	extra	mta_tax	tip_amount	tolls_amount	improvement_surcharge	total_amount	congestion_surcharge	airport_fee
i64	datetime[ns]	datetime[ns]	f64	f64	f64	str	i64	i64	i64	f64	f64	f64	f64	f64	f64	f64	f64	f64
2	2022-10-12 07:00:43	2022-10-12 12:31:57	4.0	259.33	4.0	"N"	39	265	2	-1294.5	0.0	-0.5	0.0	-6.55	-0.3	-1301.85	0.0	0.0
2	2022-10-12 07:00:43	2022-10-12 12:31:57	4.0	259.33	4.0	"N"	39	265	2	1294.5	0.0	0.5	0.0	6.55	0.3	1301.85	0.0	0.0

The payment type is 2, which is cash. Somebody paid $1300 in cash, then got it handed back to him. I was hoping the payment type would be Voided or something reasonable.

Getting rid of these invalid pairs is going to be a bit painful - we'll have to find each invalid ride and also find its pair.

Before we do that, let's see if the problems still exists in data we care about, which is data from 2020 onwards.

data.filter( pl.col("tpep_pickup_datetime") > "2020" ).describe()

Warning: Comparing date/datetime/time column to string value, this will lead to string comparison and is unlikely what you want.
If this is intended, consider using an explicit cast to silence this warning.

shape: (7, 20)

describe	VendorID	tpep_pickup_datetime	tpep_dropoff_datetime	passenger_count	trip_distance	RatecodeID	store_and_fwd_flag	PULocationID	DOLocationID	payment_type	fare_amount	extra	mta_tax	tip_amount	tolls_amount	improvement_surcharge	total_amount	congestion_surcharge	airport_fee
str	f64	str	str	f64	f64	f64	str	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64
"count"	3.675407e6	"3675407"	"3675407"	3.675407e6	3.675407e6	3.675407e6	"3675407"	3.675407e6	3.675407e6	3.675407e6	3.675407e6	3.675407e6	3.675407e6	3.675407e6	3.675407e6	3.675407e6	3.675407e6	3.675407e6	3.675407e6
"null_count"	0.0	"0"	"0"	133021.0	0.0	133021.0	"133021"	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	133021.0	133021.0
"mean"	1.720442	null	null	1.384694	6.206973	1.424337	null	165.498365	162.891771	1.188275	15.299922	0.987794	0.487229	2.859248	0.583304	0.295531	22.247504	2.280621	0.103127
"std"	0.480032	null	null	0.930231	640.824117	5.685594	null	65.213049	70.197445	0.540651	14.674414	1.242828	0.101155	3.374169	2.111966	0.051038	18.368208	0.757839	0.347973
"min"	1.0	"2022-09-30 14:06:36.000000000"	"2022-09-30 14:23:04.000000000"	0.0	0.0	1.0	"N"	1.0	1.0	0.0	-1294.5	-22.18	-0.5	-100.01	-70.0	-0.3	-1301.85	-2.5	-1.25
"max"	6.0	"2022-11-01 01:27:35.000000000"	"2022-11-03 17:26:46.000000000"	9.0	389678.46	99.0	"Y"	265.0	265.0	4.0	1294.5	10.8	25.48	500.0	516.75	1.0	1301.85	2.5	1.25
"median"	2.0	null	null	1.0	1.9	1.0	null	162.0	162.0	1.0	10.5	0.5	0.5	2.22	0.0	0.3	16.55	2.5	0.0

That didn't work - our min tpep_pickup_datetime is 2022-09-30 (I was expecting 2020). I guess I can't filter a datetime column (tpep_pickup_datetime) using a string ("2020"). Makes sense, let's fix it.

from datetime import datetime
start_date = datetime(2020,1,1,0,0,0)
print(f"Filtering data to start at {start_date}")
data.filter( pl.col("tpep_pickup_datetime") > start_date ).describe()

Filtering data to start at 2020-01-01 00:00:00

shape: (7, 20)

describe	VendorID	tpep_pickup_datetime	tpep_dropoff_datetime	passenger_count	trip_distance	RatecodeID	store_and_fwd_flag	PULocationID	DOLocationID	payment_type	fare_amount	extra	mta_tax	tip_amount	tolls_amount	improvement_surcharge	total_amount	congestion_surcharge	airport_fee
str	f64	str	str	f64	f64	f64	str	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64
"count"	3.675407e6	"3675407"	"3675407"	3.675407e6	3.675407e6	3.675407e6	"3675407"	3.675407e6	3.675407e6	3.675407e6	3.675407e6	3.675407e6	3.675407e6	3.675407e6	3.675407e6	3.675407e6	3.675407e6	3.675407e6	3.675407e6
"null_count"	0.0	"0"	"0"	133021.0	0.0	133021.0	"133021"	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	133021.0	133021.0
"mean"	1.720442	null	null	1.384694	6.206973	1.424337	null	165.498365	162.891771	1.188275	15.299922	0.987794	0.487229	2.859248	0.583304	0.295531	22.247504	2.280621	0.103127
"std"	0.480032	null	null	0.930231	640.824117	5.685594	null	65.213049	70.197445	0.540651	14.674414	1.242828	0.101155	3.374169	2.111966	0.051038	18.368208	0.757839	0.347973
"min"	1.0	"2022-09-30 14:06:36.000000000"	"2022-09-30 14:23:04.000000000"	0.0	0.0	1.0	"N"	1.0	1.0	0.0	-1294.5	-22.18	-0.5	-100.01	-70.0	-0.3	-1301.85	-2.5	-1.25
"max"	6.0	"2022-11-01 01:27:35.000000000"	"2022-11-03 17:26:46.000000000"	9.0	389678.46	99.0	"Y"	265.0	265.0	4.0	1294.5	10.8	25.48	500.0	516.75	1.0	1301.85	2.5	1.25
"median"	2.0	null	null	1.0	1.9	1.0	null	162.0	162.0	1.0	10.5	0.5	0.5	2.22	0.0	0.3	16.55	2.5	0.0

My min tpep_pickup_datetime is still 2022-09-30 (I was expecting 2020). What's going on? Maybe our data is only for Octover 2022 after all, and the 2008 date I was seeing was spurious? Note that we had 3675411 data points before we filtered by date and now we have 3675407 data points (I'm getting this by looking at the count field for tpep_pickup_datetime). Our date filtering only got rid of 4 data points. So we're looking only at October 2022 data with a couple of spurious data points thrown in.

I'm curious what the data with bad dates look like.

(
    data
    .filter(
        (pl.col("tpep_pickup_datetime") < start_date)
    )
)

shape: (4, 19)

VendorID	tpep_pickup_datetime	tpep_dropoff_datetime	passenger_count	trip_distance	RatecodeID	store_and_fwd_flag	PULocationID	DOLocationID	payment_type	fare_amount	extra	mta_tax	tip_amount	tolls_amount	improvement_surcharge	total_amount	congestion_surcharge	airport_fee
i64	datetime[ns]	datetime[ns]	f64	f64	f64	str	i64	i64	i64	f64	f64	f64	f64	f64	f64	f64	f64	f64
2	2009-01-01 01:57:50	2009-01-01 02:20:59	1.0	7.84	1.0	"N"	161	264	2	22.0	0.5	0.5	0.0	0.0	0.3	23.3	0.0	0.0
2	2008-12-31 23:02:01	2009-01-01 19:08:45	1.0	22.93	1.0	"N"	132	25	1	63.0	0.0	0.5	13.01	0.0	0.3	78.06	0.0	1.25
2	2008-12-31 23:02:02	2009-01-01 20:14:39	1.0	0.0	2.0	"N"	231	231	1	52.0	0.0	0.5	10.0	0.0	0.3	66.55	2.5	1.25
2	2009-01-01 00:02:31	2009-01-01 17:02:49	1.0	8.11	1.0	"N"	138	186	1	35.5	1.0	0.5	9.52	6.55	0.3	57.12	2.5	1.25

Finding bad pairs

I don't see any obvious ways to correlate the bad pairs - the rows don't have IDs, there is no cab ID, and the payment method field doesn't give us much either. We'll have to data science.

Let's try this: - Find negative data points - For each negative data point, find a data point with the same amount but positive that also matches other parameters such as date and pick up / drop off location

(
    data
    .filter( pl.col("fare_amount") < 0)
)

shape: (25899, 19)

VendorID	tpep_pickup_datetime	tpep_dropoff_datetime	passenger_count	trip_distance	RatecodeID	store_and_fwd_flag	PULocationID	DOLocationID	payment_type	fare_amount	extra	mta_tax	tip_amount	tolls_amount	improvement_surcharge	total_amount	congestion_surcharge	airport_fee
i64	datetime[ns]	datetime[ns]	f64	f64	f64	str	i64	i64	i64	f64	f64	f64	f64	f64	f64	f64	f64	f64
2	2022-10-01 00:52:07	2022-10-01 01:02:44	1.0	2.83	1.0	"N"	141	79	4	-10.5	-0.5	-0.5	0.0	0.0	-0.3	-14.3	-2.5	0.0
2	2022-10-01 00:29:57	2022-10-01 00:34:37	1.0	0.31	1.0	"N"	264	264	4	-4.5	-0.5	-0.5	0.0	0.0	-0.3	-8.3	-2.5	0.0
2	2022-10-01 00:31:37	2022-10-01 00:32:53	1.0	0.31	1.0	"N"	138	138	4	-3.0	-0.5	-0.5	0.0	0.0	-0.3	-5.55	0.0	-1.25
2	2022-10-01 00:46:38	2022-10-01 00:46:44	1.0	0.0	2.0	"N"	249	249	3	-52.0	0.0	-0.5	0.0	0.0	-0.3	-55.3	-2.5	0.0
2	2022-10-01 00:06:08	2022-10-01 00:48:23	1.0	17.05	1.0	"N"	238	175	4	-50.5	-0.5	-0.5	0.0	-6.55	-0.3	-60.85	-2.5	0.0
2	2022-10-01 00:40:58	2022-10-01 00:56:33	1.0	2.7	5.0	"N"	148	90	4	-27.31	0.0	-0.5	0.0	0.0	-0.3	-30.61	-2.5	0.0
2	2022-10-01 00:43:19	2022-10-01 00:44:02	1.0	0.02	1.0	"N"	90	90	3	-2.5	-0.5	-0.5	0.0	0.0	-0.3	-6.3	-2.5	0.0
2	2022-10-01 00:12:32	2022-10-01 00:12:51	1.0	0.0	1.0	"N"	48	48	4	-2.5	-0.5	-0.5	0.0	0.0	-0.3	-6.3	-2.5	0.0
2	2022-10-01 00:12:24	2022-10-01 00:13:06	1.0	0.33	1.0	"N"	138	138	4	-3.0	-0.5	-0.5	0.0	0.0	-0.3	-5.55	0.0	-1.25
2	2022-10-01 00:39:21	2022-10-01 00:40:07	4.0	0.15	1.0	"N"	230	230	3	-2.5	-0.5	-0.5	0.0	0.0	-0.3	-6.3	-2.5	0.0
2	2022-10-01 00:08:25	2022-10-01 00:21:22	2.0	1.46	1.0	"N"	79	211	4	-9.5	-0.5	-0.5	0.0	0.0	-0.3	-13.3	-2.5	0.0
2	2022-10-01 00:45:01	2022-10-01 01:18:01	2.0	2.35	1.0	"N"	79	249	4	-13.0	-0.5	-0.5	0.0	0.0	-0.3	-16.8	-2.5	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2	2022-10-06 16:17:49	2022-10-06 16:20:42	null	0.28	null	null	79	107	0	-95.68	0.0	0.5	15.22	0.0	0.3	-77.16	null	null
2	2022-10-10 12:27:10	2022-10-10 12:56:00	null	60043.58	null	null	125	68	0	-17.28	0.0	-0.5	-4.49	0.0	0.3	-24.77	null	null
2	2022-10-13 21:55:30	2022-10-13 21:56:26	null	0.03	null	null	144	144	0	-27.33	0.0	0.5	3.69	0.0	0.3	-20.34	null	null
2	2022-10-13 22:44:00	2022-10-13 22:45:00	null	0.03	null	null	90	90	0	-38.85	0.0	0.5	6.62	0.0	0.3	-28.93	null	null
2	2022-10-14 03:26:00	2022-10-14 03:27:00	null	0.04	null	null	114	114	0	-26.79	0.0	0.5	0.0	0.0	0.3	-23.49	null	null
2	2022-10-16 04:52:00	2022-10-16 05:01:00	null	2.71	null	null	233	262	0	-61.61	0.0	0.5	5.9	0.0	0.3	-52.41	null	null
1	2022-10-18 13:06:39	2022-10-18 13:06:52	null	0.0	null	null	232	232	0	-0.8	0.0	0.5	0.0	0.0	0.3	2.5	null	null
2	2022-10-19 01:55:00	2022-10-19 01:56:00	null	0.06	null	null	237	237	0	-22.39	0.0	0.5	2.92	0.0	0.3	-16.17	null	null
1	2022-10-22 19:35:59	2022-10-22 19:36:03	null	0.0	null	null	239	239	0	-0.8	0.0	0.5	0.0	0.0	0.3	2.5	null	null
2	2022-10-25 07:40:00	2022-10-25 07:42:00	null	0.07	null	null	140	140	0	-54.25	0.0	0.5	7.29	0.0	0.3	-43.66	null	null
1	2022-10-28 09:13:07	2022-10-28 09:13:26	null	0.0	null	null	141	141	0	-0.8	0.0	0.5	0.0	0.0	0.3	2.5	null	null
2	2022-10-29 07:33:24	2022-10-29 07:37:01	null	0.5	null	null	37	37	0	-18.55	0.0	0.5	2.72	0.0	0.3	-15.03	null	null

There's probably a clever way to do this, but my initial thought for getting rid of the corresponding positives is to create a new dataframe with the negative rows, then do an anti-join back to the original dataframe. An anti-join should find matching rows and remove them.

negative_data = (
    data
    .filter( pl.col("fare_amount") < 0)
).select(["tpep_pickup_datetime", "PULocationID", "fare_amount"])

negative_data

shape: (25899, 3)

tpep_pickup_datetime	PULocationID	fare_amount
datetime[ns]	i64	f64
2022-10-01 00:52:07	141	-10.5
2022-10-01 00:29:57	264	-4.5
2022-10-01 00:31:37	138	-3.0
2022-10-01 00:46:38	249	-52.0
2022-10-01 00:06:08	238	-50.5
2022-10-01 00:40:58	148	-27.31
2022-10-01 00:43:19	90	-2.5
2022-10-01 00:12:32	48	-2.5
2022-10-01 00:12:24	138	-3.0
2022-10-01 00:39:21	230	-2.5
2022-10-01 00:08:25	79	-9.5
2022-10-01 00:45:01	79	-13.0
...	...	...
2022-10-06 16:17:49	79	-95.68
2022-10-10 12:27:10	125	-17.28
2022-10-13 21:55:30	144	-27.33
2022-10-13 22:44:00	90	-38.85
2022-10-14 03:26:00	114	-26.79
2022-10-16 04:52:00	233	-61.61
2022-10-18 13:06:39	232	-0.8
2022-10-19 01:55:00	237	-22.39
2022-10-22 19:35:59	239	-0.8
2022-10-25 07:40:00	140	-54.25
2022-10-28 09:13:07	141	-0.8
2022-10-29 07:33:24	37	-18.55

I'm only grabbing a few columns for my negative_data dataframe, since all I'm going to use it for is to remove the corresponding positive rows from the original data. I'll try matching on the pick up date and pick up location.

Let's convert the negative amounts to positive so we can directly match them to their corresponding positive rows in the original data.

negative_data = negative_data.with_columns([
    (pl.col("fare_amount") * -1).alias("amount_positive")
])

negative_data

shape: (25899, 4)

tpep_pickup_datetime	PULocationID	fare_amount	amount_positive
datetime[ns]	i64	f64	f64
2022-10-01 00:52:07	141	-10.5	10.5
2022-10-01 00:29:57	264	-4.5	4.5
2022-10-01 00:31:37	138	-3.0	3.0
2022-10-01 00:46:38	249	-52.0	52.0
2022-10-01 00:06:08	238	-50.5	50.5
2022-10-01 00:40:58	148	-27.31	27.31
2022-10-01 00:43:19	90	-2.5	2.5
2022-10-01 00:12:32	48	-2.5	2.5
2022-10-01 00:12:24	138	-3.0	3.0
2022-10-01 00:39:21	230	-2.5	2.5
2022-10-01 00:08:25	79	-9.5	9.5
2022-10-01 00:45:01	79	-13.0	13.0
...	...	...	...
2022-10-06 16:17:49	79	-95.68	95.68
2022-10-10 12:27:10	125	-17.28	17.28
2022-10-13 21:55:30	144	-27.33	27.33
2022-10-13 22:44:00	90	-38.85	38.85
2022-10-14 03:26:00	114	-26.79	26.79
2022-10-16 04:52:00	233	-61.61	61.61
2022-10-18 13:06:39	232	-0.8	0.8
2022-10-19 01:55:00	237	-22.39	22.39
2022-10-22 19:35:59	239	-0.8	0.8
2022-10-25 07:40:00	140	-54.25	54.25
2022-10-28 09:13:07	141	-0.8	0.8
2022-10-29 07:33:24	37	-18.55	18.55

Now that we have a dataframe with the negative amounts converted to positive, we can join back to the original data, matching on the (now positive) fare amount and the pick up location. We'll do an anti-join, which gets rids of the rows that match. If Polars didn't have anti-join as an option we could have merged the new dataframe with the original data and dropped duplicate rows.

data.join(negative_data, on=("PULocationID", "fare_amount"), how="anti")

shape: (3649512, 19)

VendorID	tpep_pickup_datetime	tpep_dropoff_datetime	passenger_count	trip_distance	RatecodeID	store_and_fwd_flag	PULocationID	DOLocationID	payment_type	fare_amount	extra	mta_tax	tip_amount	tolls_amount	improvement_surcharge	total_amount	congestion_surcharge	airport_fee
i64	datetime[ns]	datetime[ns]	f64	f64	f64	str	i64	i64	i64	f64	f64	f64	f64	f64	f64	f64	f64	f64
1	2022-10-01 00:03:41	2022-10-01 00:18:39	1.0	1.7	1.0	"N"	249	107	1	9.5	3.0	0.5	2.65	0.0	0.3	15.95	2.5	0.0
2	2022-10-01 00:14:30	2022-10-01 00:19:48	2.0	0.72	1.0	"N"	151	238	2	5.5	0.5	0.5	0.0	0.0	0.3	9.3	2.5	0.0
2	2022-10-01 00:27:13	2022-10-01 00:37:41	1.0	1.74	1.0	"N"	238	166	1	9.0	0.5	0.5	2.06	0.0	0.3	12.36	0.0	0.0
1	2022-10-01 00:32:53	2022-10-01 00:38:55	0.0	1.3	1.0	"N"	142	239	1	6.5	3.0	0.5	2.05	0.0	0.3	12.35	2.5	0.0
1	2022-10-01 00:44:55	2022-10-01 00:50:21	0.0	1.0	1.0	"N"	238	166	1	6.0	0.5	0.5	1.8	0.0	0.3	9.1	0.0	0.0
1	2022-10-01 00:22:52	2022-10-01 00:52:14	1.0	6.8	1.0	"Y"	186	41	2	25.5	3.0	0.5	0.0	0.0	0.3	29.3	2.5	0.0
2	2022-10-01 00:33:19	2022-10-01 00:44:51	3.0	1.88	1.0	"N"	162	145	2	10.5	0.5	0.5	0.0	0.0	0.3	14.3	2.5	0.0
1	2022-10-01 00:02:42	2022-10-01 00:50:01	1.0	12.2	1.0	"N"	100	22	1	41.0	3.0	0.5	3.0	0.0	0.3	47.8	2.5	0.0
2	2022-10-01 00:06:35	2022-10-01 00:24:38	1.0	7.79	1.0	"N"	138	112	1	23.5	0.5	0.5	4.96	0.0	0.3	31.01	0.0	1.25
2	2022-10-01 00:29:25	2022-10-01 00:43:15	1.0	4.72	1.0	"N"	145	75	1	14.5	0.5	0.5	1.5	0.0	0.3	19.8	2.5	0.0
1	2022-10-01 00:01:55	2022-10-01 00:20:16	1.0	8.8	1.0	"N"	138	236	1	26.0	4.25	0.5	5.64	6.55	0.3	43.24	2.5	1.25
1	2022-10-01 00:27:48	2022-10-01 00:59:50	1.0	8.6	1.0	"N"	140	36	1	29.5	3.0	0.5	6.0	0.0	0.3	39.3	2.5	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2	2022-10-31 23:30:42	2022-10-31 23:40:16	null	1.7	null	null	158	90	0	9.95	0.0	0.5	2.93	0.0	0.3	16.18	null	null
1	2022-10-31 23:42:05	2022-11-01 00:13:58	null	17.5	null	null	132	68	0	52.0	1.25	0.5	9.46	6.55	0.3	72.56	null	null
2	2022-10-31 23:52:34	2022-11-01 00:23:32	null	12.6	null	null	116	37	0	40.67	0.0	0.5	5.26	6.55	0.3	53.28	null	null
2	2022-10-31 23:16:12	2022-10-31 23:32:36	null	6.23	null	null	158	166	0	22.5	0.0	0.5	5.67	0.0	0.3	31.47	null	null
2	2022-10-31 23:15:00	2022-10-31 23:20:00	null	0.72	null	null	142	142	0	9.95	0.0	0.5	2.92	0.0	0.3	16.17	null	null
1	2022-10-31 23:20:21	2022-10-31 23:34:15	null	2.7	null	null	163	68	0	12.0	0.5	0.5	3.95	0.0	0.3	19.75	null	null
2	2022-10-31 23:45:37	2022-11-01 00:00:39	null	2.45	null	null	249	162	0	11.97	0.0	0.5	3.39	0.0	0.3	18.66	null	null
2	2022-10-31 23:56:35	2022-11-01 00:10:11	null	2.25	null	null	137	50	0	12.68	0.0	0.5	3.16	0.0	0.3	19.14	null	null
2	2022-10-31 23:22:00	2022-10-31 23:28:00	null	1.22	null	null	142	161	0	10.0	0.0	0.5	1.0	0.0	0.3	14.3	null	null
2	2022-10-31 23:25:00	2022-10-31 23:48:00	null	4.7	null	null	186	45	0	19.07	0.0	0.5	4.97	0.0	0.3	27.34	null	null
2	2022-10-31 23:32:54	2022-10-31 23:33:02	null	0.0	null	null	264	263	0	16.82	0.0	0.5	4.58	0.0	0.3	22.2	null	null
1	2022-10-31 23:21:34	2022-10-31 23:32:42	null	0.0	null	null	166	116	0	8.93	0.0	0.5	0.0	0.0	0.3	9.73	null	null

Did that work? Let's see how many rows that got rid of. After the anti-join we end up with 3649512 rows. We had and 25899 rows with negative values.

n_rows_original = data.shape[0]
n_rows_negative = negative_data.shape[0]
print(f"We are expecting {n_rows_original - n_rows_negative} rows if our removal worked perfectly")

We are expecting 3649512 rows if our removal worked perfectly

Sure enough the result of the anti-join has 3649512 rows (you can see that from the shape output a few cells above), so our method is working well. We should sanity check to make sure it's working as expected, but for now let's assume all is well.

We can now do our final cleaning: we'll get rid of the negative and corresponding positive rows, get rid of anything where the mileage is over 3000 miles, remove dates that are out of range, and get rid of rows with unknown pick up and drop off locations:

clean_data = (
    data
    .join(negative_data, on=("PULocationID", "fare_amount"), how="anti")
    .filter( pl.col("fare_amount") >= 0)
    .filter( pl.col("trip_distance") <= 3000 )
    .filter( pl.col("tpep_pickup_datetime") >= start_date )
    .filter( 
        (pl.col("PULocationID") != 265) &
        (pl.col("DOLocationID") != 265)
    )
)

clean_data

shape: (3630123, 19)

VendorID	tpep_pickup_datetime	tpep_dropoff_datetime	passenger_count	trip_distance	RatecodeID	store_and_fwd_flag	PULocationID	DOLocationID	payment_type	fare_amount	extra	mta_tax	tip_amount	tolls_amount	improvement_surcharge	total_amount	congestion_surcharge	airport_fee
i64	datetime[ns]	datetime[ns]	f64	f64	f64	str	i64	i64	i64	f64	f64	f64	f64	f64	f64	f64	f64	f64
1	2022-10-01 00:03:41	2022-10-01 00:18:39	1.0	1.7	1.0	"N"	249	107	1	9.5	3.0	0.5	2.65	0.0	0.3	15.95	2.5	0.0
2	2022-10-01 00:14:30	2022-10-01 00:19:48	2.0	0.72	1.0	"N"	151	238	2	5.5	0.5	0.5	0.0	0.0	0.3	9.3	2.5	0.0
2	2022-10-01 00:27:13	2022-10-01 00:37:41	1.0	1.74	1.0	"N"	238	166	1	9.0	0.5	0.5	2.06	0.0	0.3	12.36	0.0	0.0
1	2022-10-01 00:32:53	2022-10-01 00:38:55	0.0	1.3	1.0	"N"	142	239	1	6.5	3.0	0.5	2.05	0.0	0.3	12.35	2.5	0.0
1	2022-10-01 00:44:55	2022-10-01 00:50:21	0.0	1.0	1.0	"N"	238	166	1	6.0	0.5	0.5	1.8	0.0	0.3	9.1	0.0	0.0
1	2022-10-01 00:22:52	2022-10-01 00:52:14	1.0	6.8	1.0	"Y"	186	41	2	25.5	3.0	0.5	0.0	0.0	0.3	29.3	2.5	0.0
2	2022-10-01 00:33:19	2022-10-01 00:44:51	3.0	1.88	1.0	"N"	162	145	2	10.5	0.5	0.5	0.0	0.0	0.3	14.3	2.5	0.0
1	2022-10-01 00:02:42	2022-10-01 00:50:01	1.0	12.2	1.0	"N"	100	22	1	41.0	3.0	0.5	3.0	0.0	0.3	47.8	2.5	0.0
2	2022-10-01 00:06:35	2022-10-01 00:24:38	1.0	7.79	1.0	"N"	138	112	1	23.5	0.5	0.5	4.96	0.0	0.3	31.01	0.0	1.25
2	2022-10-01 00:29:25	2022-10-01 00:43:15	1.0	4.72	1.0	"N"	145	75	1	14.5	0.5	0.5	1.5	0.0	0.3	19.8	2.5	0.0
1	2022-10-01 00:01:55	2022-10-01 00:20:16	1.0	8.8	1.0	"N"	138	236	1	26.0	4.25	0.5	5.64	6.55	0.3	43.24	2.5	1.25
1	2022-10-01 00:27:48	2022-10-01 00:59:50	1.0	8.6	1.0	"N"	140	36	1	29.5	3.0	0.5	6.0	0.0	0.3	39.3	2.5	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2	2022-10-31 23:30:42	2022-10-31 23:40:16	null	1.7	null	null	158	90	0	9.95	0.0	0.5	2.93	0.0	0.3	16.18	null	null
1	2022-10-31 23:42:05	2022-11-01 00:13:58	null	17.5	null	null	132	68	0	52.0	1.25	0.5	9.46	6.55	0.3	72.56	null	null
2	2022-10-31 23:52:34	2022-11-01 00:23:32	null	12.6	null	null	116	37	0	40.67	0.0	0.5	5.26	6.55	0.3	53.28	null	null
2	2022-10-31 23:16:12	2022-10-31 23:32:36	null	6.23	null	null	158	166	0	22.5	0.0	0.5	5.67	0.0	0.3	31.47	null	null
2	2022-10-31 23:15:00	2022-10-31 23:20:00	null	0.72	null	null	142	142	0	9.95	0.0	0.5	2.92	0.0	0.3	16.17	null	null
1	2022-10-31 23:20:21	2022-10-31 23:34:15	null	2.7	null	null	163	68	0	12.0	0.5	0.5	3.95	0.0	0.3	19.75	null	null
2	2022-10-31 23:45:37	2022-11-01 00:00:39	null	2.45	null	null	249	162	0	11.97	0.0	0.5	3.39	0.0	0.3	18.66	null	null
2	2022-10-31 23:56:35	2022-11-01 00:10:11	null	2.25	null	null	137	50	0	12.68	0.0	0.5	3.16	0.0	0.3	19.14	null	null
2	2022-10-31 23:22:00	2022-10-31 23:28:00	null	1.22	null	null	142	161	0	10.0	0.0	0.5	1.0	0.0	0.3	14.3	null	null
2	2022-10-31 23:25:00	2022-10-31 23:48:00	null	4.7	null	null	186	45	0	19.07	0.0	0.5	4.97	0.0	0.3	27.34	null	null
2	2022-10-31 23:32:54	2022-10-31 23:33:02	null	0.0	null	null	264	263	0	16.82	0.0	0.5	4.58	0.0	0.3	22.2	null	null
1	2022-10-31 23:21:34	2022-10-31 23:32:42	null	0.0	null	null	166	116	0	8.93	0.0	0.5	0.0	0.0	0.3	9.73	null	null

How many rows of data did we get rid of?

data.shape[0] - clean_data.shape[0]

clean_data.describe()

shape: (7, 20)

describe	VendorID	tpep_pickup_datetime	tpep_dropoff_datetime	passenger_count	trip_distance	RatecodeID	store_and_fwd_flag	PULocationID	DOLocationID	payment_type	fare_amount	extra	mta_tax	tip_amount	tolls_amount	improvement_surcharge	total_amount	congestion_surcharge	airport_fee
str	f64	str	str	f64	f64	f64	str	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64
"count"	3.630123e6	"3630123"	"3630123"	3.630123e6	3.630123e6	3.630123e6	"3630123"	3.630123e6	3.630123e6	3.630123e6	3.630123e6	3.630123e6	3.630123e6	3.630123e6	3.630123e6	3.630123e6	3.630123e6	3.630123e6	3.630123e6
"null_count"	0.0	"0"	"0"	126893.0	0.0	126893.0	"126893"	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	126893.0	126893.0
"mean"	1.712003	null	null	1.384814	3.554337	1.413354	null	165.373444	162.575655	1.174523	15.200393	1.001434	0.494995	2.854254	0.571016	0.299767	22.190213	2.319343	0.103112
"std"	0.45283	null	null	0.931426	4.456602	5.684473	null	65.14861	70.034565	0.504213	13.14293	1.242161	0.0556	3.227445	2.036301	0.009334	16.768073	0.648033	0.343979
"min"	1.0	"2022-09-30 14:06:36.000000000"	"2022-09-30 14:23:04.000000000"	0.0	0.0	1.0	"N"	1.0	1.0	0.0	0.0	-22.18	-0.5	0.0	0.0	-0.3	-4.55	-2.5	-1.25
"max"	2.0	"2022-11-01 01:27:35.000000000"	"2022-11-03 17:26:46.000000000"	9.0	462.86	99.0	"Y"	264.0	264.0	4.0	950.0	10.8	25.48	500.0	516.75	0.3	950.3	2.5	1.25
"median"	2.0	null	null	1.0	1.9	1.0	null	162.0	162.0	1.0	10.5	0.5	0.5	2.25	0.0	0.3	16.55	2.5	0.0

That's looking better - we still have some negatives and other strange looking things, but you get a sense of how we'd approach getting rid of those. For next steps I'd probably filter more things that look incorrect, then I'd plot the data as a histogram and eyeball it to see if there's anything strange in there.

Let's stop there.

This was my first foray into Polars, so it's likely there are more polarsy ways of doing things - if you know of improvements please let me know.

Overall I really like Polars. It's fast and clean and the resulting code is easy on the eyes. I'm likely to use it in place of Pandas going forward.

Links

Modern Polars is quite nice with many examples.

Towards Data Science has a good article called Polars: Pandas DataFrame but Much Faster.

This article from Code Magazine has a lot of good Polars examples.

The Polars documentation page on Coming from Pandas is very useful if you have Pandas experience.

While we're at it, read this piece from a former cab driver on the ride he'll never forget.