Skip to content
Snippets Groups Projects
lecture-part2.ipynb 25.3 KiB
Newer Older
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Working with Data II"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Agenda\n",
    "\n",
    "* List Comprehensions\n",
    "* Data Cleaning with Pandas\n",
    "  * Empty Cells\n",
    "  * Wrong Format\n",
    "  * Wrong Data\n",
    "  * Removing Duplicates\n",
    "* Modifing Data\n",
    "  * Collumns operations\n",
    "  * Aggregation and Grouping\n",
    "* Pivoting and Reshaping\n",
    "* Storing Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## **List Comprehensions**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* **What are list comprehensions?**\n",
    "  * List comprehensions provide a concise way to create lists in Python.\n",
    "  * They allow you to create a lists with conditional content using a **single line of code**\n",
    "  * The syntax is easy to read "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* **Why do we use list comprehensions?**\n",
    "  \n",
    "  * **Cleaner Code**: They reduce the amount of code needed to create or transform a list.\n",
    "  * **Enhanced Readability**: When used properly, list comprehensions can make your code more **readable** and **expressive**.\n",
    "  * **Flexibility**: They support **conditional logic** and **nested loops**, making them highly versatile."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's see how this works with an example.\n",
    "\n",
    "Given a list of fruits, filter out all fruits that contain the letter 'a'\n",
    "\n",
    "First, let's see how we would solve this using conventional syntax"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['apple', 'banana', 'mango']\n"
     ]
    }
   ],
   "source": [
    "fruits = [\"apple\", \"banana\", \"cherry\", \"kiwi\", \"mango\"] #given list\n",
    "\n",
    "newlist = [] #1. create new list\n",
    "\n",
    "for x in fruits: #2. iterate through list \n",
    "  if \"a\" in x: #3. if condition true if the letter a is in the word\n",
    "    newlist.append(x) #4. add the fruit to the new list \n",
    "\n",
    "print(newlist)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With a list comprehension you can do all that with only one line of code"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['apple', 'banana', 'mango']\n"
     ]
    }
   ],
   "source": [
    "fruits = [\"apple\", \"banana\", \"cherry\", \"kiwi\", \"mango\"]\n",
    "\n",
    "newlist =  [fruit for fruit in fruits if \"a\" in fruit] #the entire logic in one line\n",
    "\n",
    "print (newlist)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this single line, we created a new list, we iterated through fruits and selected only the items that contain the letter 'a'\n",
    "\n",
    "unbelievable!\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In list comprehensions, these three components work together as follows:\n",
    "\n",
    " * **Output:** Defines what each element in the new list will look like (often just the element itself or a transformed version).\n",
    " * **For loop with an iterable element:** Iterates over each item in the original list or iterable, bringing each item into the comprehension for processing.\n",
    " * **Condition and expression: (Optional)** Filters the items to include in the new list based on a condition, or applies an expression to transform each item."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![List comprehension example](.\\Pictures\\Listcomprehension.PNG)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['APPLE', 'BANANA', 'CHERRY', 'KIWI', 'MANGO']\n"
     ]
    }
   ],
   "source": [
    "newlist = [x.upper() for x in fruits]\n",
    "print(newlist)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The expression can also contain conditions, not like a filter, but as a way to manipulate the outcome:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Original fruit basket: ['apple', 'banana', 'cherry', 'kiwi', 'mango']\n",
      "Updated fruit basket: ['apple', 'banana vanished into thin air 🍌💨', 'cherry', 'kiwi', 'mango']\n"
     ]
    }
   ],
   "source": [
    "fruits = [\"apple\", \"banana\", \"cherry\", \"kiwi\", \"mango\"]\n",
    "\n",
    "print(\"Original fruit basket:\", fruits)\n",
    "newlist = [fruit if fruit != \"banana\" else \"banana vanished into thin air 🍌💨\" for fruit in fruits]\n",
    "\n",
    "print(\"Updated fruit basket:\", newlist)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## **Data Cleaning with Pandas**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* **What is Data Cleaning?**\n",
    "    * Data Cleaning is the process of transforming messy data into a structured, consistent format.\n",
    "    * Using Pandas, we can handle common issues like missing values, duplicates, and incorrect data types efficiently.\n",
    "    * Pandas offers simple and powerful methods to streamline this process, ensuring data accuracy for analysis and modeling.\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### **Remove missing data**\n",
    "\n",
    "With the ```dropna()``` method you can remove missing data so called NaN (Not a Number) values.\n",
    "* **Basic Usage:** ```dropna()``` removes rows (or columns) with any NaN values by default.\n",
    "\n",
    "* The method supports the following parameters\n",
    "    * **Axis Parameter:** Specify axis=0 to drop rows with NaN values (default) or axis=1 to drop columns instead.\n",
    "    * **Threshold Parameter:** Use thresh to set a minimum number of non-NaN values required to keep a row or column.\n",
    "    * **Subset Parameter:** Define specific columns to check for NaNs with the subset parameter, allowing for selective cleaning."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Original DataFrame:\n",
      "   Customer   Age         City  Purchase_Amount\n",
      "0    Alice  24.0     New York             85.0\n",
      "1      Bob   NaN  Los Angeles              NaN\n",
      "2     None  35.0         None            120.0\n",
      "3    David   NaN      Chicago             92.0\n",
      "4      Eve  28.0         None              NaN\n",
      "\n",
      "Rows with any missing values removed:\n",
      "   Customer   Age      City  Purchase_Amount\n",
      "0    Alice  24.0  New York             85.0\n",
      "\n",
      "Columns with any missing values removed:\n",
      " Empty DataFrame\n",
      "Columns: []\n",
      "Index: [0, 1, 2, 3, 4]\n",
      "\n",
      "Rows with at least 3 non-missing values:\n",
      "   Customer   Age      City  Purchase_Amount\n",
      "0    Alice  24.0  New York             85.0\n",
      "3    David   NaN   Chicago             92.0\n",
      "\n",
      "Rows removed only if 'Customer' or 'City' has missing values:\n",
      "   Customer   Age         City  Purchase_Amount\n",
      "0    Alice  24.0     New York             85.0\n",
      "1      Bob   NaN  Los Angeles              NaN\n",
      "3    David   NaN      Chicago             92.0\n"
     ]
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "# Sample DataFrame with  missing values\n",
    "data = {\n",
    "    'Customer': ['Alice', 'Bob', None, 'David', 'Eve'],\n",
    "    'Age': [24, None, 35, None, 28],\n",
    "    'City': ['New York', 'Los Angeles', None, 'Chicago', None],\n",
    "    'Purchase_Amount': [85, None, 120, 92, None]\n",
    "}\n",
    "\n",
    "df = pd.DataFrame(data)\n",
    "print(\"Original DataFrame:\\n\", df)\n",
    "\n",
    "# 1. Basic Usage: Drop rows where any NaN values are present\n",
    "df_dropped_any = df.dropna(axis=0)\n",
    "print(\"\\nRows with any missing values removed:\\n\", df_dropped_any)\n",
    "\n",
    "# 2. Axis Parameter: Drop columns with any NaN values\n",
    "df_dropped_columns = df.dropna(axis=1)\n",
    "print(\"\\nColumns with any missing values removed:\\n\", df_dropped_columns)\n",
    "\n",
    "# 3. Threshold Parameter: Keep rows with at least 3 non-NaN values\n",
    "df_thresh = df.dropna(thresh=3)\n",
    "print(\"\\nRows with at least 3 non-missing values:\\n\", df_thresh)\n",
    "\n",
    "# 4. Subset Parameter: Only check the 'Customer' and 'City' columns for missing values\n",
    "df_subset = df.dropna(subset=['Customer', 'City'])\n",
    "print(\"\\nRows removed only if 'Customer' or 'City' has missing values:\\n\", df_subset)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notes: Data Cleaning with Pandas\n",
    "\n",
    "* df.fillna(): Fill missing values with a specified value.\n",
    "* df.drop(): Drop rows or columns.\n",
    "* df.duplicated(), df.drop_duplicates(): Find and remove duplicate rows."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### **Fill Missing Data**\n",
    "With the ```fillna()``` method: Replace missing data (NaN values) with specified values.\n",
    "\n",
    "* The method supports the following parameters\n",
    "    * **Value Parameter:** Define a specific value to fill NaNs with, such as ```fillna(value=0)``` or ```fillna(value=df.mean())``` for mean imputation.\n",
    "    * **Threshold Parameter:** Use thresh to set a minimum number of non-NaN values required to keep a row or column.\n",
    "    * **Subset Parameter:** Define specific columns to check for NaNs with the subset parameter, allowing for selective cleaning."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Original DataFrame:\n",
      "      A    B    C\n",
      "0  1.0  NaN  5.0\n",
      "1  NaN  2.0  NaN\n",
      "2  3.0  NaN  NaN\n",
      "3  NaN  4.0  8.0\n",
      "4  5.0  NaN  9.0\n",
      "\n",
      "Filled with 0:\n",
      "      A    B    C\n",
      "0  1.0  0.0  5.0\n",
      "1  0.0  2.0  0.0\n",
      "2  3.0  0.0  0.0\n",
      "3  0.0  4.0  8.0\n",
      "4  5.0  0.0  9.0\n",
      "\n",
      "Filled with column mean:\n",
      "      A    B         C\n",
      "0  1.0  3.0  5.000000\n",
      "1  3.0  2.0  7.333333\n",
      "2  3.0  3.0  7.333333\n",
      "3  3.0  4.0  8.000000\n",
      "4  5.0  3.0  9.000000\n"
     ]
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "# Sample DataFrame with missing values\n",
    "data = {\n",
    "    'A': [1, None, 3, None, 5],\n",
    "    'B': [None, 2, None, 4, None],\n",
    "    'C': [5, None, None, 8, 9]\n",
    "}\n",
    "\n",
    "df = pd.DataFrame(data)\n",
    "\n",
    "# 1. Basic Value Parameter: Fill NaNs with a specified value\n",
    "df_filled_value = df.fillna(value=0)\n",
    "\n",
    "# 2. Using the Mean of Each Column: Fill NaNs with column means\n",
    "df_filled_mean = df.fillna(value=df.mean())\n",
    "\n",
    "\n",
    "print(\"Original DataFrame:\\n\", df)\n",
    "print(\"\\nFilled with 0:\\n\", df_filled_value)\n",
    "print(\"\\nFilled with column mean:\\n\", df_filled_mean)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### **Delete Row's and Colums**\n",
    "\n",
    "With the ```df.drop()``` method: you can remove specified rows or columns from a DataFrame.\n",
    "* The method supports the following parameters:\n",
    "    * **Labels Parameter:** Specify row or column labels to remove e.g., ```labels=['Column1']```\n",
    "    * **Axis Parameter:** Set axis=0 to drop rows (default) or axis=1 to drop columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Original DataFrame:\n",
      "       Name  Age        City\n",
      "0    Alice   24         Zug\n",
      "1      Bob   19        Züri\n",
      "2     Elmo   24  St. Gallen\n",
      "3  Charlie   23        Bern\n",
      "4      Bob   19        Biel\n",
      "\n",
      "DataFrame after dropping row with index 1 (Bob):\n",
      "       Name  Age        City\n",
      "0    Alice   24         Zug\n",
      "2     Elmo   24  St. Gallen\n",
      "3  Charlie   23        Bern\n",
      "4      Bob   19        Biel\n",
      "\n",
      "DataFrame after dropping the 'City' column:\n",
      "       Name  Age\n",
      "0    Alice   24\n",
      "1      Bob   19\n",
      "2     Elmo   24\n",
      "3  Charlie   23\n",
      "4      Bob   19\n"
     ]
    }
   ],
   "source": [
    "data = {\n",
    "    'Name': ['Alice', 'Bob', 'Elmo', 'Charlie', 'Bob'],\n",
    "    'Age': [24, 19, 24, 23, 19],\n",
    "    'City': ['Zug', 'Züri', 'St. Gallen', 'Bern','Biel']\n",
    "}\n",
    "\n",
    "df = pd.DataFrame(data)\n",
    "print(\"Original DataFrame:\\n\", df)\n",
    "\n",
    "# 1. Remove a specific row by label (e.g., drop the row at index 1, which is 'Bob')\n",
    "df_dropped_row = df.drop(labels=[1], axis=0)\n",
    "print(\"\\nDataFrame after dropping row with index 1 (Bob):\\n\", df_dropped_row)\n",
    "\n",
    "# 2. Remove a specific column by label (e.g., drop the 'City' column)\n",
    "df_dropped_column = df.drop(labels=['City'], axis=1)\n",
    "print(\"\\nDataFrame after dropping the 'City' column:\\n\", df_dropped_column)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### **Delete Duplicate's**\n",
    "\n",
    "The Pandas method to remove duplicates is ```drop_duplicates()```. \n",
    "Parameters:\n",
    "* Subset: Specify columns to check for duplicates, e.g., subset=['Column1', 'Column2'].\n",
    "* Keep: Defines which duplicate to keep:\n",
    "    * first (default): Keeps the first occurrence of each duplicate.\n",
    "    * last: Keeps the last occurrence.\n",
    "    * False: Removes all duplicates.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "      Name  Age\n",
      "0    Alice   24\n",
      "1      Bob   19\n",
      "3  Charlie   24\n"
     ]
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "# Sample DataFrame with Alice duplicated\n",
    "data = {\n",
    "    'Name': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob'],\n",
    "    'Age': [24, 19, 24, 24, 19]\n",
    "}\n",
    "\n",
    "df = pd.DataFrame(data)\n",
    "\n",
    "# Remove duplicates based on the 'Name' column, keeping the first occurrence\n",
    "df_unique = df.drop_duplicates(subset=['Name'], keep='first')\n",
    "\n",
    "print(df_unique)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Parameters:\n",
    "Subset: Specify columns to check for duplicates, e.g., df.drop_duplicates(subset=['Column1', 'Column2']).\n",
    "Keep: Defines which duplicate to keep:\n",
    "first (default): Keeps the first occurrence of each duplicate.\n",
    "last: Keeps the last occurrence.\n",
    "False: Removes all duplicates.\n",
    "In-Place: By default, drop_duplicates() returns a new DataFrame. Set inplace=True to modify the original DataFrame directly."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### **Group Data by Columns**\n",
    "With the ```df.groupby()``` method: You can group data by one or more columns, allowing you to perform aggregate operations on each group separately.\n",
    "\n",
    "* **Basic Usage:** groupby() groups data based on the values of specified columns and can then apply functions like sum(), mean(), count(), etc., to each group.\n",
    "\n",
    "### **Aggregations**\n",
    "\n",
    "Aggregations in Pandas are functions that summarize or compute a single result from multiple values within a DataFrame or Series.\n",
    "\n",
    "* **df.sum()**: Calculates the sum of values for each column (or row if specified). By default, it skips any NaN values.\n",
    "  Example: df.sum()  # Returns the total sum for each numerical column.\n",
    "\n",
    "* **df.mean():** Computes the mean (average) of values for each column (or row). It also skips NaN values by default.\n",
    "  Example: df.mean()  # Gives the average value for each numerical column.\n",
    "\n",
    "* **df.count():** Counts the non-NaN entries for each column (or row). This is useful for understanding the amount of valid data in each column.\n",
    "  Example: df.count()  # Returns the number of non-missing values for each column.\n",
    "\n",
    "* **df.min():** Finds the minimum value in each column (or row). For non-numeric columns, it finds the minimum based on alphabetical order.\n",
    "  Example: df.min()  # Gives the smallest value in each column.\n",
    "\n",
    "* **df.max():** Identifies the maximum value in each column (or row). For non-numeric columns, it finds the maximum alphabetically.\n",
    "  Example: df.max()  # Returns the largest value in each column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Average Salary by Department:\n",
      " Department\n",
      "Engineer    100000.0\n",
      "HR           57500.0\n",
      "IT           77500.0\n",
      "Name: Salary, dtype: float64\n"
     ]
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "# Sample DataFrame\n",
    "data = {\n",
    "    'Department': ['Engineer', 'Engineer', 'HR', 'HR', 'IT', 'IT'],\n",
    "    'Employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],\n",
    "    'Salary': [90000, 110000, 60000, 55000, 75000, 80000]\n",
    "}\n",
    "\n",
    "df = pd.DataFrame(data)\n",
    "\n",
    "# Group data by the 'Department' column and calculate the average salary for each department\n",
    "grouped_df = df.groupby('Department')['Salary'].mean()\n",
    "\n",
    "print(\"Average Salary by Department:\\n\", grouped_df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Sum of Sales and Quantity by Product:\n",
      "         Sales  Quantity\n",
      "Product                 \n",
      "A          730        51\n",
      "B          720        72\n"
     ]
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "# Sample DataFrame\n",
    "data = {\n",
    "    'Region': ['North', 'South', 'East', 'West', 'North', 'South', 'East', 'West'],\n",
    "    'Product': ['A', 'A', 'B', 'B', 'A', 'B', 'A', 'B'],\n",
    "    'Sales': [150, 200, 130, 180, 210, 220, 170, 190],\n",
    "    'Quantity': [10, 15, 13, 18, 12, 22, 14, 19]\n",
    "}\n",
    "\n",
    "df = pd.DataFrame(data)\n",
    "\n",
    "# Grouping by 'Product' and calculating the sum for 'Sales' and 'Quantity'\n",
    "product_summary = df.groupby('Product')[['Sales', 'Quantity']].sum()\n",
    "\n",
    "print(\"Sum of Sales and Quantity by Product:\")\n",
    "print(product_summary)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### **pivot_table():** Summarize Data with a Pivot Table\n",
    "We can create a pivot table to summarize sales by region and product across months:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Pivot Table (Average Sales by Region and Product):\n",
      "Product      A      B\n",
      "Region               \n",
      "East     170.0  130.0\n",
      "North    180.0    NaN\n",
      "South    200.0  220.0\n",
      "West       NaN  185.0\n"
     ]
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "# Sample DataFrame\n",
    "data = {\n",
    "    'Region': ['North', 'South', 'East', 'West', 'North', 'South', 'East', 'West'],\n",
    "    'Product': ['A', 'A', 'B', 'B', 'A', 'B', 'A', 'B'],\n",
    "    'Sales': [150, 200, 130, 180, 210, 220, 170, 190],\n",
    "    'Quantity': [10, 15, 13, 18, 12, 22, 14, 19]\n",
    "}\n",
    "\n",
    "df = pd.DataFrame(data)\n",
    "\n",
    "pivot_df = df.pivot_table(values='Sales', index='Region', columns='Product', aggfunc='mean')\n",
    "\n",
    "print(\"\\nPivot Table (Average Sales by Region and Product):\")\n",
    "print(pivot_df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Product      A      B\n",
      "Region               \n",
      "East     170.0  130.0\n",
      "North    180.0    0.0\n",
      "South    200.0  220.0\n",
      "West       0.0  185.0\n"
     ]
    }
   ],
   "source": [
    "filled_pivot_df = pivot_df.fillna(value=0)\n",
    "\n",
    "print(filled_pivot_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**df.melt():** Unpivot from Wide to Long Format\n",
    "\n",
    "If we want to convert the DataFrame to a long format (one column for \"variable\" and one for \"value\"), we use melt(). For example, to have a column for each attribute (Region, Month, Product, Sales), use:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Melted DataFrame (Unpivoted):\n",
      "   Region Attribute Value\n",
      "0   North   Product     A\n",
      "1   South   Product     A\n",
      "2    East   Product     B\n",
      "3    West   Product     B\n",
      "4   North   Product     A\n",
      "5   South   Product     B\n",
      "6    East   Product     A\n",
      "7    West   Product     B\n",
      "8   North     Sales   150\n",
      "9   South     Sales   200\n",
      "10   East     Sales   130\n",
      "11   West     Sales   180\n",
      "12  North     Sales   210\n",
      "13  South     Sales   220\n",
      "14   East     Sales   170\n",
      "15   West     Sales   190\n"
     ]
    }
   ],
   "source": [
    "melted_df = df.melt(id_vars=['Region'], value_vars=['Product', 'Sales'], var_name='Attribute', value_name='Value')\n",
    "\n",
    "print(\"\\nMelted DataFrame (Unpivoted):\")\n",
    "print(melted_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### **df.stack() and df.unstack():** Reshape by Stacking and Unstacking\n",
    "\n",
    "Let's say we have a pivot table and we want to reshape it using stack() and unstack(). For example, starting with pivot_df from above:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Original DataFrame:\n",
      "Product      A      B\n",
      "Region               \n",
      "East     170.0  130.0\n",
      "North    180.0    NaN\n",
      "South    200.0  220.0\n",
      "West       NaN  185.0\n",
      "\n",
      "Stacked DataFrame:\n",
      "Region  Product\n",
      "East    A          170.0\n",
      "        B          130.0\n",
      "North   A          180.0\n",
      "South   A          200.0\n",
      "        B          220.0\n",
      "West    B          185.0\n",
      "dtype: float64\n"
     ]
    }
   ],
   "source": [
    "# Stack: Move columns (Product) into the row index level\n",
    "stacked_df = pivot_df.stack()\n",
    "\n",
    "print(\"\\nOriginal DataFrame:\")\n",
    "print(pivot_df)\n",
    "\n",
    "print(\"\\nStacked DataFrame:\")\n",
    "print(stacked_df)\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### **Saving Data as a CSV**\n",
    "CSV (Comma-Separated Values) is a simple and widely-used text format. Each line in a CSV file represents a row of data, with columns separated by commas.\n",
    "\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.to_csv('data.csv', index=False)  # index=False avoids saving the row indices"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}