Newer
Older
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Agenda\n",
"\n",
"* List Comprehensions\n",
"* Data Cleaning with Pandas\n",
" * Empty Cells\n",
" * Wrong Format\n",
" * Wrong Data\n",
" * Removing Duplicates\n",
"* Modifing Data\n",
" * Collumns operations\n",
" * Aggregation and Grouping\n",
"* Pivoting and Reshaping\n",
"* Storing Data"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## **List Comprehensions**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* **What are list comprehensions?**\n",
" * List comprehensions provide a concise way to create lists in Python.\n",
" * They allow you to create a lists with conditional content using a **single line of code**\n",
" * The syntax is easy to read "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* **Why do we use list comprehensions?**\n",
" \n",
" * **Cleaner Code**: They reduce the amount of code needed to create or transform a list.\n",
" * **Enhanced Readability**: When used properly, list comprehensions can make your code more **readable** and **expressive**.\n",
" * **Flexibility**: They support **conditional logic** and **nested loops**, making them highly versatile."
]
},
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's see how this works with an example.\n",
"\n",
"Given a list of fruits, filter out all fruits that contain the letter 'a'\n",
"\n",
"First, let's see how we would solve this using conventional syntax"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['apple', 'banana', 'mango']\n"
]
}
],
"source": [
"fruits = [\"apple\", \"banana\", \"cherry\", \"kiwi\", \"mango\"] #given list\n",
"\n",
"newlist = [] #1. create new list\n",
"\n",
"for x in fruits: #2. iterate through list \n",
" if \"a\" in x: #3. if condition true if the letter a is in the word\n",
" newlist.append(x) #4. add the fruit to the new list \n",
"\n",
"print(newlist)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With a list comprehension you can do all that with only one line of code"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['apple', 'banana', 'mango']\n"
]
}
],
"source": [
"fruits = [\"apple\", \"banana\", \"cherry\", \"kiwi\", \"mango\"]\n",
"\n",
"newlist = [fruit for fruit in fruits if \"a\" in fruit] #the entire logic in one line\n",
"\n",
"print (newlist)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this single line, we created a new list, we iterated through fruits and selected only the items that contain the letter 'a'\n",
"\n",
"unbelievable!\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In list comprehensions, these three components work together as follows:\n",
"\n",
" * **Output:** Defines what each element in the new list will look like (often just the element itself or a transformed version).\n",
" * **For loop with an iterable element:** Iterates over each item in the original list or iterable, bringing each item into the comprehension for processing.\n",
" * **Condition and expression: (Optional)** Filters the items to include in the new list based on a condition, or applies an expression to transform each item."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['APPLE', 'BANANA', 'CHERRY', 'KIWI', 'MANGO']\n"
]
}
],
"source": [
"newlist = [x.upper() for x in fruits]\n",
"print(newlist)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The expression can also contain conditions, not like a filter, but as a way to manipulate the outcome:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Original fruit basket: ['apple', 'banana', 'cherry', 'kiwi', 'mango']\n",
"Updated fruit basket: ['apple', 'banana vanished into thin air 🍌💨', 'cherry', 'kiwi', 'mango']\n"
]
}
],
"source": [
"fruits = [\"apple\", \"banana\", \"cherry\", \"kiwi\", \"mango\"]\n",
"\n",
"print(\"Original fruit basket:\", fruits)\n",
"newlist = [fruit if fruit != \"banana\" else \"banana vanished into thin air 🍌💨\" for fruit in fruits]\n",
"\n",
"print(\"Updated fruit basket:\", newlist)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## **Data Cleaning with Pandas**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* **What is Data Cleaning?**\n",
" * Data Cleaning is the process of transforming messy data into a structured, consistent format.\n",
" * Using Pandas, we can handle common issues like missing values, duplicates, and incorrect data types efficiently.\n",
" * Pandas offers simple and powerful methods to streamline this process, ensuring data accuracy for analysis and modeling.\n",
"\n",
"\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### **Remove missing data**\n",
"\n",
"With the ```dropna()``` method you can remove missing data so called NaN (Not a Number) values.\n",
"* **Basic Usage:** ```dropna()``` removes rows (or columns) with any NaN values by default.\n",
"\n",
"* The method supports the following parameters\n",
" * **Axis Parameter:** Specify axis=0 to drop rows with NaN values (default) or axis=1 to drop columns instead.\n",
" * **Threshold Parameter:** Use thresh to set a minimum number of non-NaN values required to keep a row or column.\n",
" * **Subset Parameter:** Define specific columns to check for NaNs with the subset parameter, allowing for selective cleaning."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Original DataFrame:\n",
" Customer Age City Purchase_Amount\n",
"0 Alice 24.0 New York 85.0\n",
"1 Bob NaN Los Angeles NaN\n",
"2 None 35.0 None 120.0\n",
"3 David NaN Chicago 92.0\n",
"4 Eve 28.0 None NaN\n",
"\n",
"Rows with any missing values removed:\n",
" Customer Age City Purchase_Amount\n",
"0 Alice 24.0 New York 85.0\n",
"\n",
"Columns with any missing values removed:\n",
" Empty DataFrame\n",
"Columns: []\n",
"Index: [0, 1, 2, 3, 4]\n",
"\n",
"Rows with at least 3 non-missing values:\n",
" Customer Age City Purchase_Amount\n",
"0 Alice 24.0 New York 85.0\n",
"3 David NaN Chicago 92.0\n",
"\n",
"Rows removed only if 'Customer' or 'City' has missing values:\n",
" Customer Age City Purchase_Amount\n",
"0 Alice 24.0 New York 85.0\n",
"1 Bob NaN Los Angeles NaN\n",
"3 David NaN Chicago 92.0\n"
]
}
],
"source": [
"import pandas as pd\n",
"\n",
"# Sample DataFrame with missing values\n",
"data = {\n",
" 'Customer': ['Alice', 'Bob', None, 'David', 'Eve'],\n",
" 'Age': [24, None, 35, None, 28],\n",
" 'City': ['New York', 'Los Angeles', None, 'Chicago', None],\n",
" 'Purchase_Amount': [85, None, 120, 92, None]\n",
"}\n",
"\n",
"df = pd.DataFrame(data)\n",
"print(\"Original DataFrame:\\n\", df)\n",
"\n",
"# 1. Basic Usage: Drop rows where any NaN values are present\n",
"df_dropped_any = df.dropna(axis=0)\n",
"print(\"\\nRows with any missing values removed:\\n\", df_dropped_any)\n",
"\n",
"# 2. Axis Parameter: Drop columns with any NaN values\n",
"df_dropped_columns = df.dropna(axis=1)\n",
"print(\"\\nColumns with any missing values removed:\\n\", df_dropped_columns)\n",
"\n",
"# 3. Threshold Parameter: Keep rows with at least 3 non-NaN values\n",
"df_thresh = df.dropna(thresh=3)\n",
"print(\"\\nRows with at least 3 non-missing values:\\n\", df_thresh)\n",
"\n",
"# 4. Subset Parameter: Only check the 'Customer' and 'City' columns for missing values\n",
"df_subset = df.dropna(subset=['Customer', 'City'])\n",
"print(\"\\nRows removed only if 'Customer' or 'City' has missing values:\\n\", df_subset)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notes: Data Cleaning with Pandas\n",
"\n",
"* df.fillna(): Fill missing values with a specified value.\n",
"* df.drop(): Drop rows or columns.\n",
"* df.duplicated(), df.drop_duplicates(): Find and remove duplicate rows."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
"With the ```fillna()``` method: Replace missing data (NaN values) with specified values.\n",
"\n",
"* The method supports the following parameters\n",
" * **Value Parameter:** Define a specific value to fill NaNs with, such as ```fillna(value=0)``` or ```fillna(value=df.mean())``` for mean imputation.\n",
" * **Threshold Parameter:** Use thresh to set a minimum number of non-NaN values required to keep a row or column.\n",
" * **Subset Parameter:** Define specific columns to check for NaNs with the subset parameter, allowing for selective cleaning."
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Original DataFrame:\n",
" A B C\n",
"0 1.0 NaN 5.0\n",
"1 NaN 2.0 NaN\n",
"2 3.0 NaN NaN\n",
"3 NaN 4.0 8.0\n",
"4 5.0 NaN 9.0\n",
"\n",
"Filled with 0:\n",
" A B C\n",
"0 1.0 0.0 5.0\n",
"1 0.0 2.0 0.0\n",
"2 3.0 0.0 0.0\n",
"3 0.0 4.0 8.0\n",
"4 5.0 0.0 9.0\n",
"\n",
"Filled with column mean:\n",
" A B C\n",
"0 1.0 3.0 5.000000\n",
"1 3.0 2.0 7.333333\n",
"2 3.0 3.0 7.333333\n",
"3 3.0 4.0 8.000000\n",
"4 5.0 3.0 9.000000\n"
]
}
],
"source": [
"import pandas as pd\n",
"\n",
"# Sample DataFrame with missing values\n",
"data = {\n",
" 'A': [1, None, 3, None, 5],\n",
" 'B': [None, 2, None, 4, None],\n",
" 'C': [5, None, None, 8, 9]\n",
"}\n",
"\n",
"df = pd.DataFrame(data)\n",
"\n",
"# 1. Basic Value Parameter: Fill NaNs with a specified value\n",
"df_filled_value = df.fillna(value=0)\n",
"\n",
"# 2. Using the Mean of Each Column: Fill NaNs with column means\n",
"df_filled_mean = df.fillna(value=df.mean())\n",
"\n",
"\n",
"print(\"Original DataFrame:\\n\", df)\n",
"print(\"\\nFilled with 0:\\n\", df_filled_value)\n",
"print(\"\\nFilled with column mean:\\n\", df_filled_mean)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### **Delete Row's and Colums**\n",
"\n",
"With the ```df.drop()``` method: you can remove specified rows or columns from a DataFrame.\n",
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
"* The method supports the following parameters:\n",
" * **Labels Parameter:** Specify row or column labels to remove e.g., ```labels=['Column1']```\n",
" * **Axis Parameter:** Set axis=0 to drop rows (default) or axis=1 to drop columns."
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Original DataFrame:\n",
" Name Age City\n",
"0 Alice 24 Zug\n",
"1 Bob 19 Züri\n",
"2 Elmo 24 St. Gallen\n",
"3 Charlie 23 Bern\n",
"4 Bob 19 Biel\n",
"\n",
"DataFrame after dropping row with index 1 (Bob):\n",
" Name Age City\n",
"0 Alice 24 Zug\n",
"2 Elmo 24 St. Gallen\n",
"3 Charlie 23 Bern\n",
"4 Bob 19 Biel\n",
"\n",
"DataFrame after dropping the 'City' column:\n",
" Name Age\n",
"0 Alice 24\n",
"1 Bob 19\n",
"2 Elmo 24\n",
"3 Charlie 23\n",
"4 Bob 19\n"
]
}
],
"source": [
"data = {\n",
" 'Name': ['Alice', 'Bob', 'Elmo', 'Charlie', 'Bob'],\n",
" 'Age': [24, 19, 24, 23, 19],\n",
" 'City': ['Zug', 'Züri', 'St. Gallen', 'Bern','Biel']\n",
"}\n",
"\n",
"df = pd.DataFrame(data)\n",
"print(\"Original DataFrame:\\n\", df)\n",
"\n",
"# 1. Remove a specific row by label (e.g., drop the row at index 1, which is 'Bob')\n",
"df_dropped_row = df.drop(labels=[1], axis=0)\n",
"print(\"\\nDataFrame after dropping row with index 1 (Bob):\\n\", df_dropped_row)\n",
"\n",
"# 2. Remove a specific column by label (e.g., drop the 'City' column)\n",
"df_dropped_column = df.drop(labels=['City'], axis=1)\n",
"print(\"\\nDataFrame after dropping the 'City' column:\\n\", df_dropped_column)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### **Delete Duplicate's**\n",
"\n",
"The Pandas method to remove duplicates is ```drop_duplicates()```. \n",
"Parameters:\n",
"* Subset: Specify columns to check for duplicates, e.g., subset=['Column1', 'Column2'].\n",
"* Keep: Defines which duplicate to keep:\n",
" * first (default): Keeps the first occurrence of each duplicate.\n",
" * last: Keeps the last occurrence.\n",
" * False: Removes all duplicates.\n"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" Name Age\n",
"0 Alice 24\n",
"1 Bob 19\n",
"3 Charlie 24\n"
]
}
],
"source": [
"import pandas as pd\n",
"\n",
"# Sample DataFrame with Alice duplicated\n",
"data = {\n",
" 'Name': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob'],\n",
" 'Age': [24, 19, 24, 24, 19]\n",
"}\n",
"\n",
"df = pd.DataFrame(data)\n",
"\n",
"# Remove duplicates based on the 'Name' column, keeping the first occurrence\n",
"df_unique = df.drop_duplicates(subset=['Name'], keep='first')\n",
"\n",
"print(df_unique)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Parameters:\n",
"Subset: Specify columns to check for duplicates, e.g., df.drop_duplicates(subset=['Column1', 'Column2']).\n",
"Keep: Defines which duplicate to keep:\n",
"first (default): Keeps the first occurrence of each duplicate.\n",
"last: Keeps the last occurrence.\n",
"False: Removes all duplicates.\n",
"In-Place: By default, drop_duplicates() returns a new DataFrame. Set inplace=True to modify the original DataFrame directly."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### **Group Data by Columns**\n",
"With the ```df.groupby()``` method: You can group data by one or more columns, allowing you to perform aggregate operations on each group separately.\n",
"\n",
"* **Basic Usage:** groupby() groups data based on the values of specified columns and can then apply functions like sum(), mean(), count(), etc., to each group.\n",
"\n",
"### **Aggregations**\n",
"\n",
"Aggregations in Pandas are functions that summarize or compute a single result from multiple values within a DataFrame or Series.\n",
"\n",
"* **df.sum()**: Calculates the sum of values for each column (or row if specified). By default, it skips any NaN values.\n",
" Example: df.sum() # Returns the total sum for each numerical column.\n",
"\n",
"* **df.mean():** Computes the mean (average) of values for each column (or row). It also skips NaN values by default.\n",
" Example: df.mean() # Gives the average value for each numerical column.\n",
"\n",
"* **df.count():** Counts the non-NaN entries for each column (or row). This is useful for understanding the amount of valid data in each column.\n",
" Example: df.count() # Returns the number of non-missing values for each column.\n",
"\n",
"* **df.min():** Finds the minimum value in each column (or row). For non-numeric columns, it finds the minimum based on alphabetical order.\n",
" Example: df.min() # Gives the smallest value in each column.\n",
"\n",
"* **df.max():** Identifies the maximum value in each column (or row). For non-numeric columns, it finds the maximum alphabetically.\n",
" Example: df.max() # Returns the largest value in each column."
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Average Salary by Department:\n",
" Department\n",
"Engineer 100000.0\n",
"HR 57500.0\n",
"IT 77500.0\n",
"Name: Salary, dtype: float64\n"
]
}
],
"source": [
"import pandas as pd\n",
"\n",
"# Sample DataFrame\n",
"data = {\n",
" 'Department': ['Engineer', 'Engineer', 'HR', 'HR', 'IT', 'IT'],\n",
" 'Employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],\n",
" 'Salary': [90000, 110000, 60000, 55000, 75000, 80000]\n",
"}\n",
"\n",
"df = pd.DataFrame(data)\n",
"\n",
"# Group data by the 'Department' column and calculate the average salary for each department\n",
"grouped_df = df.groupby('Department')['Salary'].mean()\n",
"\n",
"print(\"Average Salary by Department:\\n\", grouped_df)"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Sum of Sales and Quantity by Product:\n",
" Sales Quantity\n",
"Product \n",
"A 730 51\n",
"B 720 72\n"
]
}
],
"source": [
"import pandas as pd\n",
"\n",
"# Sample DataFrame\n",
"data = {\n",
" 'Region': ['North', 'South', 'East', 'West', 'North', 'South', 'East', 'West'],\n",
" 'Product': ['A', 'A', 'B', 'B', 'A', 'B', 'A', 'B'],\n",
" 'Sales': [150, 200, 130, 180, 210, 220, 170, 190],\n",
" 'Quantity': [10, 15, 13, 18, 12, 22, 14, 19]\n",
"}\n",
"\n",
"df = pd.DataFrame(data)\n",
"\n",
"# Grouping by 'Product' and calculating the sum for 'Sales' and 'Quantity'\n",
"product_summary = df.groupby('Product')[['Sales', 'Quantity']].sum()\n",
"\n",
"print(\"Sum of Sales and Quantity by Product:\")\n",
"print(product_summary)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### **pivot_table():** Summarize Data with a Pivot Table\n",
"We can create a pivot table to summarize sales by region and product across months:"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Pivot Table (Average Sales by Region and Product):\n",
"Product A B\n",
"Region \n",
"East 170.0 130.0\n",
"North 180.0 NaN\n",
"South 200.0 220.0\n",
"West NaN 185.0\n"
]
}
],
"source": [
"import pandas as pd\n",
"\n",
"# Sample DataFrame\n",
"data = {\n",
" 'Region': ['North', 'South', 'East', 'West', 'North', 'South', 'East', 'West'],\n",
" 'Product': ['A', 'A', 'B', 'B', 'A', 'B', 'A', 'B'],\n",
" 'Sales': [150, 200, 130, 180, 210, 220, 170, 190],\n",
" 'Quantity': [10, 15, 13, 18, 12, 22, 14, 19]\n",
"}\n",
"\n",
"df = pd.DataFrame(data)\n",
"\n",
"pivot_df = df.pivot_table(values='Sales', index='Region', columns='Product', aggfunc='mean')\n",
"\n",
"print(\"\\nPivot Table (Average Sales by Region and Product):\")\n",
"print(pivot_df)"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Product A B\n",
"Region \n",
"East 170.0 130.0\n",
"North 180.0 0.0\n",
"South 200.0 220.0\n",
"West 0.0 185.0\n"
]
}
],
"source": [
"filled_pivot_df = pivot_df.fillna(value=0)\n",
"\n",
"print(filled_pivot_df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**df.melt():** Unpivot from Wide to Long Format\n",
"\n",
"If we want to convert the DataFrame to a long format (one column for \"variable\" and one for \"value\"), we use melt(). For example, to have a column for each attribute (Region, Month, Product, Sales), use:"
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Melted DataFrame (Unpivoted):\n",
" Region Attribute Value\n",
"0 North Product A\n",
"1 South Product A\n",
"2 East Product B\n",
"3 West Product B\n",
"4 North Product A\n",
"5 South Product B\n",
"6 East Product A\n",
"7 West Product B\n",
"8 North Sales 150\n",
"9 South Sales 200\n",
"10 East Sales 130\n",
"11 West Sales 180\n",
"12 North Sales 210\n",
"13 South Sales 220\n",
"14 East Sales 170\n",
"15 West Sales 190\n"
]
}
],
"source": [
"melted_df = df.melt(id_vars=['Region'], value_vars=['Product', 'Sales'], var_name='Attribute', value_name='Value')\n",
"\n",
"print(\"\\nMelted DataFrame (Unpivoted):\")\n",
"print(melted_df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
"source": [
"### **df.stack() and df.unstack():** Reshape by Stacking and Unstacking\n",
"\n",
"Let's say we have a pivot table and we want to reshape it using stack() and unstack(). For example, starting with pivot_df from above:"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Original DataFrame:\n",
"Product A B\n",
"Region \n",
"East 170.0 130.0\n",
"North 180.0 NaN\n",
"South 200.0 220.0\n",
"West NaN 185.0\n",
"\n",
"Stacked DataFrame:\n",
"Region Product\n",
"East A 170.0\n",
" B 130.0\n",
"North A 180.0\n",
"South A 200.0\n",
" B 220.0\n",
"West B 185.0\n",
"dtype: float64\n"
]
}
],
"source": [
"# Stack: Move columns (Product) into the row index level\n",
"stacked_df = pivot_df.stack()\n",
"\n",
"print(\"\\nOriginal DataFrame:\")\n",
"print(pivot_df)\n",
"\n",
"print(\"\\nStacked DataFrame:\")\n",
"print(stacked_df)\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### **Saving Data as a CSV**\n",
"CSV (Comma-Separated Values) is a simple and widely-used text format. Each line in a CSV file represents a row of data, with columns separated by commas.\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {},
"outputs": [],
"source": [
"df.to_csv('data.csv', index=False) # index=False avoids saving the row indices"
]
}
],
"metadata": {
"celltoolbar": "Slideshow",
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",