Skip to content
Snippets Groups Projects
Commit 2b0ac20a authored by Jöran Frey's avatar Jöran Frey
Browse files

added solution for working with data II

parent 2b856d04
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id:b8ad1699 tags:
# 7 Working with Data II
%% Cell type:markdown id:bac932a5 tags:
## 1 List Comprehensions
List Comprehensions are a concise way to create a new list whose elements are an operation on each element of an existing list or other iterable structure. List comprehensions provide an elegant and readable syntax to simplify the creation of lists and make the code more compact.
**Example**
[**Expression** for **Element** in **Iterable_element** if **Condition**]
- **Expression:** An expression that defines the element in the new list.
- **Element:** An element from the iterable structure that is iterated over.
- **Iterable_element:** An existing list or another structure that is iterated through.
- **Condition** (optional): A condition that determines whether the element is included in the new list.
%% Cell type:markdown id:f664b431 tags:
### 1.1 Create a list with the squares of the numbers from 1 to 10
Use a list comprehension to do this.
%% Cell type:code id:527f8b71 tags:
``` python
squares = [x**2 for x in range(1,11,1)]
print(squares)
```
%% Output
[1, 4, 9, 16, 25, 36, 49, 64, 81, 100]
%% Cell type:markdown id:4be7b5f1 tags:
**Task:** Extend the script so that only even digits are output
%% Cell type:code id:0aa05216 tags:
``` python
squares = [x**2 for x in range(1, 11, 1) if x%2 == 0]
print(squares)
```
%% Output
[4, 16, 36, 64, 100]
%% Cell type:code id:74199360 tags:
``` python
# without list comprehension
even_squares = []
for x in range(1,11):
if x % 2 == 0:
even_squares.append(x**2)
print(even_squares)
```
%% Output
[4, 16, 36, 64, 100]
%% Cell type:markdown id:e50cf222 tags:
### 1.2 Filtering words
Given a list of words. Use a List Comprehension to create a new list containing only words that are longer than 5 letters and end with a vowel <br>
**TIP:** the length of a word can be output with the command len
```Python
len(‘string’)
%% Cell type:code id:2262f180 tags:
``` python
words= ["banana", "apple", "orange tree", "cherry", "lemon", "melon", "water", "computer", "information", "keyboard"]
```
%% Cell type:code id:1c9fe26e tags:
``` python
wordssorted = [word for word in words if len(word)>=5 and word[-1] in "aeiou"]
print(wordssorted)
```
%% Output
['banana', 'apple', 'orange tree']
%% Cell type:code id:5b97a72a tags:
``` python
vocals = "aeiouAEIOU"
shortlist = [word for word in words if len(word) > 5 and word[-1] in vocals ]
print(shortlist)
```
%% Output
['banana', 'orange tree']
%% Cell type:markdown id:ba7035e8 tags:
### 1.3 Nested list comprehensions
Given a matrix (list of lists) with numbers. Create a flat list of all numbers in the matrix that are divisible by 3. Use a nested list comprehension for this.
%% Cell type:markdown id:a7d65faa tags:
**Outer Loop:** ```for row in matrix```
This goes through each small list (row) inside the main matrix list. So, it will look at each row one by one.
**Inner Loop:** ```for num in row```
For each row, it goes through each num (number) inside that row. This means it’s checking each number in that row one at a time.
%% Cell type:code id:e1bc07a0 tags:
``` python
globals().clear()
matrix = [
[1, 2, 3, 4, 5, 6],
[7, 8, 9, 10, 11, 12],
[13, 14, 15, 16, 17, 18],
[19, 20, 21, 22, 23, 24]
]
```
%% Cell type:code id:a30b7942 tags:
``` python
numbers =[num for row in matrix for num in row if num % 3 == 0]
print(numbers)
```
%% Output
[3, 6, 9, 12, 15, 18, 21, 24]
%% Cell type:markdown id:c655dd9c tags:
## 2 Data Cleaning with Pandas
%% Cell type:markdown id:9b0139a2 tags:
### 2.1 Remove NaN values
**Task:** Import the DataFrame named ```10_Word_filtering.csv``` Using a method to remove rows with more than 1 missing values, filter out rows with fewer than two entries across the rows of example data columns. Display the final DataFrame, showing only rows that meet this requirement.
%% Cell type:code id:500bf1b6 tags:
``` python
import pandas as pd
# Load the data
df_profiles = pd.read_csv("./Testdata/10_Word_filtering.csv")
# Drop rows that have fewer than for non-NaN values
filtered_df = df_profiles.dropna(thresh=4) # Set thresh to 4
print("\nFiltered DataFrame:")
print(filtered_df)
```
%% Output
Filtered DataFrame:
Example 1 Example 2 Example 3 Example 4 Example 5
0 Word1 Word21 Word41 NaN Word81
1 Word2 Word22 Word42 NaN Word82
2 Word3 Word23 Word43 Word63 NaN
5 Word6 Word26 Word46 Word66 NaN
8 Word9 Word29 Word49 Word69 Word89
9 Word10 Word30 Word50 NaN Word90
10 Word11 NaN Word51 Word71 Word91
11 Word12 Word32 Word52 Word72 Word92
12 Word13 Word33 Word53 Word73 NaN
13 Word14 Word34 NaN Word74 Word94
16 Word19 Word39 Word59 NaN Word99
%% Cell type:markdown id:b864d54d tags:
### 2.2 Analyze Sales Data by Product and Region
%% Cell type:markdown id:ffb536ed tags:
Given a sales dataset ```20_sales_data_2024.csv``` for a fictional company, analyze
* **1. Total Sales by Product**
* **2. Average Sales by Region**
* **3. Top 3 Best-Selling Products Overall**
* **4. Highest Revenue Region for Each Product**
%% Cell type:code id:ca1e2c11 tags:
``` python
import pandas as pd
df_sales = pd.read_csv("./Testdata/20_sales_data_2024.csv")
#look at dataframe
print(df_sales.head(5))
# 1. Total Sales by Product
total_sales_by_product = df_sales.groupby('Product')['Sales Amount'].sum().reset_index()
print("Total Sales by Product:")
print(total_sales_by_product)
# 2. Average Sales by Region
average_sales_by_region = df_sales.groupby('Region')['Sales Amount'].mean().reset_index()
print("\nAverage Sales by Region:")
print(average_sales_by_region)
# 3. Top 3 Best-Selling Products Overall
top_3_products = total_sales_by_product.sort_values(by='Sales Amount', ascending=False).head(3)
print("\nTop 3 Best-Selling Products Overall:")
print(top_3_products.head(4))
# 4. Highest Revenue Region for every Product
highest_revenue_region_by_product = df_sales.groupby(['Product', 'Region'])['Sales Amount'].sum().reset_index()
highest_revenue_region_by_product = highest_revenue_region_by_product.sort_values(['Product', 'Sales Amount'], ascending=[True, False])
highest_revenue_region_by_product = highest_revenue_region_by_product.drop_duplicates(subset=['Product'], keep='first')
print("\nHighest Revenue Region for Each Product:")
print(highest_revenue_region_by_product)
```
%% Output
Date Product Region Sales Amount Quantity Sold
0 2024-04-12 Product D Asia 743.12 3
1 2024-12-14 Product E South America 1420.23 11
2 2024-09-27 Product C Africa 4859.09 53
3 2024-04-16 Product C North America 1690.17 46
4 2024-03-12 Product A North America 2436.10 30
Total Sales by Product:
Product Sales Amount
0 Product A 510064.23
1 Product B 492452.92
2 Product C 496559.03
3 Product D 470472.52
4 Product E 519674.26
Average Sales by Region:
Region Sales Amount
0 Africa 2572.376368
1 Asia 2171.946957
2 Europe 2605.275885
3 North America 2672.068906
4 South America 2421.155281
Top 3 Best-Selling Products Overall:
Product Sales Amount
4 Product E 519674.26
0 Product A 510064.23
2 Product C 496559.03
Highest Revenue Region for Each Product:
Product Region Sales Amount
0 Product A Africa 127460.48
9 Product B South America 151319.93
10 Product C Africa 126428.13
18 Product D North America 118419.80
23 Product E North America 124364.07
Example 1,Example 2,Example 3,Example 4,Example 5
Word1,Word21,Word41,,Word81
Word2,Word22,Word42,,Word82
Word3,Word23,Word43,Word63,
Word4,Word24,,,Word84
Word5,,Word45,,Word85
Word6,Word26,Word46,Word66,
Word7,,,Word67,Word87
Word8,Word28,,,
Word9,Word29,Word49,Word69,Word89
Word10,Word30,Word50,,Word90
Word11,,Word51,Word71,Word91
Word12,Word32,Word52,Word72,Word92
Word13,Word33,Word53,Word73,
Word14,Word34,,Word74,Word94
,Word37,Word57,,Word97
Word18,Word38,,,
Word19,Word39,Word59,,Word99
Word20,Word40,,,
This diff is collapsed.
import pandas as pd
import numpy as np
# Define parameters for dataset creation
np.random.seed(42)
num_records = 1000 # Number of records
# Generate random dates within the year 2024
date_range = pd.date_range(start="2024-01-01", end="2024-12-31", freq='D')
dates = np.random.choice(date_range, num_records)
# Define product names and regions
products = ['Product A', 'Product B', 'Product C', 'Product D', 'Product E']
regions = ['North America', 'Europe', 'Asia', 'South America', 'Africa']
# Generate random data
product_choices = np.random.choice(products, num_records)
region_choices = np.random.choice(regions, num_records)
sales_amount = np.round(np.random.uniform(50, 5000, num_records))
quantity_sold = np.random.randint(1, 100, num_records)
# Create the DataFrame
sales_data = pd.DataFrame({
'Date': dates,
'Product': product_choices,
'Region': region_choices,
'Sales Amount': sales_amount,
'Quantity Sold': quantity_sold
})
print(sales_data)
# Save the DataFrame to a CSV file
sales_data.to_csv("./20_sales_data_2024.csv", index=False)
print("Dataset saved as '20_sales_data_2024.csv'")
unit-6/images/anatomyofplot.webp

64 KiB

%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
# MathPlot # MathPlot
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Agemda ## Agemda
* Introduction to MathPlot * Introduction to MathPlot
* Elemets of a Figure * Elemets of a Figure
* Of Figures and Plots * Of Figures and Plots
%% Cell type:markdown id: tags:
## **Matplotlib**
* **What is Matplotlib?**
* **A Powerful Visualization Library:** Matplotlib is a core Python library that allows you to create a wide range of data visualizations, from simple plots to complex graphics.
* **Why Use Matplotlib?**
* **Highly Customizable:** Offers flexibility to adjust colors, styles, and layouts,
* **Works Seamlessly with Other Libraries:** Integrates well with data tools like pandas or numpy
* **Industry Standard:** Widely used in data science, research, and engineering, making it a valuable skill in many technical fields.
%% Cell type:markdown id: tags:
<img src=".\images\anatomyofplot.webp" width="700"/>
%% Cell type:markdown id: tags:
### Introduction to Matplotlib
What is Matplotlib?
Matplotlib is a popular Python library for creating static, interactive, and animated plots and visualizations.
It’s widely used for data analysis, making it easy to understand data through visual representation.
Why Use Matplotlib?
Helps turn complex data into charts, making patterns and trends easier to see.
Essential tool in data science, machine learning, and research.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
https://matplotlib.org/ https://matplotlib.org/
``` ```
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment