diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..87620ac --- /dev/null +++ b/.gitignore @@ -0,0 +1 @@ +.ipynb_checkpoints/ diff --git a/lab-sql-python-connection.ipynb b/lab-sql-python-connection.ipynb new file mode 100644 index 0000000..35bdb67 --- /dev/null +++ b/lab-sql-python-connection.ipynb @@ -0,0 +1,616 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "eccf68ee", + "metadata": {}, + "source": [ + "# lab-sql-python-connection" + ] + }, + { + "cell_type": "markdown", + "id": "59e00018", + "metadata": {}, + "source": [ + "## 1. Import Libraries\n", + "\n", + "We will use:\n", + "\n", + "- `pandas` to work with DataFrames.\n", + "- `sqlalchemy` to create the connection engine.\n", + "- `text` from SQLAlchemy to safely write SQL queries with parameters." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "5ad93640", + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "from sqlalchemy import create_engine, text" + ] + }, + { + "cell_type": "markdown", + "id": "4c0b813b", + "metadata": {}, + "source": [ + "## 2. Create the Database Connection\n", + "\n", + "Here we create the connection between Python and the Sakila database.\n", + "\n", + "> Replace `your_password` with your own MySQL password." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "96da5463", + "metadata": {}, + "outputs": [], + "source": [ + "# Database connection settings\n", + "password = \"your_password\" # Replace with your MySQL password\n", + "db = \"sakila\"\n", + "\n", + "# Create the connection string\n", + "connection_string = f\"mysql+pymysql://root:{password}@localhost/{db}\"\n", + "\n", + "# Create the engine\n", + "engine = create_engine(connection_string)" + ] + }, + { + "cell_type": "markdown", + "id": "81d390f6", + "metadata": {}, + "source": [ + "### Test the connection\n", + "\n", + "Before moving forward, it is useful to test if the connection works correctly." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5ab16ebf", + "metadata": {}, + "outputs": [], + "source": [ + "# Test connection\n", + "query = text(\"SELECT * FROM rental LIMIT 5;\")\n", + "\n", + "sample_rentals = pd.read_sql(query, engine)\n", + "sample_rentals" + ] + }, + { + "cell_type": "markdown", + "id": "b02b4c0a", + "metadata": {}, + "source": [ + "## 3. Function 1: Retrieve Rentals by Month\n", + "\n", + "The function `rentals_month()` retrieves all rental records for a specific month and year.\n", + "\n", + "It receives three parameters:\n", + "\n", + "- `engine`: the database connection engine.\n", + "- `month`: the month we want to analyze.\n", + "- `year`: the year we want to analyze.\n", + "\n", + "The function returns a Pandas DataFrame." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4232b247", + "metadata": {}, + "outputs": [], + "source": [ + "def rentals_month(engine, month, year):\n", + " \"\"\"\n", + " Retrieves rental data for a specific month and year from the Sakila database.\n", + " \n", + " Parameters:\n", + " engine: SQLAlchemy engine used to connect to the database.\n", + " month: Integer representing the month.\n", + " year: Integer representing the year.\n", + " \n", + " Returns:\n", + " A pandas DataFrame with rental data for the selected month and year.\n", + " \"\"\"\n", + " \n", + " query = text(\"\"\"\n", + " SELECT *\n", + " FROM rental\n", + " WHERE MONTH(rental_date) = :month\n", + " AND YEAR(rental_date) = :year;\n", + " \"\"\")\n", + " \n", + " df = pd.read_sql(query, engine, params={\"month\": month, \"year\": year})\n", + " \n", + " return df" + ] + }, + { + "cell_type": "markdown", + "id": "30bc8a27", + "metadata": {}, + "source": [ + "## 4. Retrieve May and June 2005 Rentals\n", + "\n", + "According to the challenge, we need to analyze customers who were active in both May and June." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6328647f", + "metadata": {}, + "outputs": [], + "source": [ + "# Retrieve rental data for May and June 2005\n", + "rentals_may = rentals_month(engine, 5, 2005)\n", + "rentals_june = rentals_month(engine, 6, 2005)\n", + "\n", + "print(\"May rentals shape:\", rentals_may.shape)\n", + "print(\"June rentals shape:\", rentals_june.shape)" + ] + }, + { + "cell_type": "markdown", + "id": "31b3194e", + "metadata": {}, + "source": [ + "### Preview May rentals" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2e46dba7", + "metadata": {}, + "outputs": [], + "source": [ + "rentals_may.head()" + ] + }, + { + "cell_type": "markdown", + "id": "98a67688", + "metadata": {}, + "source": [ + "### Preview June rentals" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c3f09fd2", + "metadata": {}, + "outputs": [], + "source": [ + "rentals_june.head()" + ] + }, + { + "cell_type": "markdown", + "id": "f210e167", + "metadata": {}, + "source": [ + "## 5. Basic Exploration\n", + "\n", + "Before comparing customers, we can quickly check how many rental transactions happened in each month and how many unique customers were active." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "301d370d", + "metadata": {}, + "outputs": [], + "source": [ + "may_total_rentals = rentals_may.shape[0]\n", + "june_total_rentals = rentals_june.shape[0]\n", + "\n", + "may_unique_customers = rentals_may[\"customer_id\"].nunique()\n", + "june_unique_customers = rentals_june[\"customer_id\"].nunique()\n", + "\n", + "summary = pd.DataFrame({\n", + " \"month\": [\"May 2005\", \"June 2005\"],\n", + " \"total_rentals\": [may_total_rentals, june_total_rentals],\n", + " \"unique_customers\": [may_unique_customers, june_unique_customers]\n", + "})\n", + "\n", + "summary" + ] + }, + { + "cell_type": "markdown", + "id": "a3327298", + "metadata": {}, + "source": [ + "### Initial Insight\n", + "\n", + "This summary helps us understand the overall activity in each month before going into the customer-level comparison.\n", + "\n", + "If June has more rentals or more active customers than May, it may suggest an increase in customer activity." + ] + }, + { + "cell_type": "markdown", + "id": "3cc96811", + "metadata": {}, + "source": [ + "## 6. Function 2: Count Rentals by Customer and Month\n", + "\n", + "The function `rental_count_month()` receives the rental DataFrame for one month and counts how many rentals each customer made.\n", + "\n", + "The new column name is created dynamically using the month and year.\n", + "\n", + "For example:\n", + "\n", + "- May 2005 → `rentals_05_2005`\n", + "- June 2005 → `rentals_06_2005`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7aaa8e7c", + "metadata": {}, + "outputs": [], + "source": [ + "def rental_count_month(df, month, year):\n", + " \"\"\"\n", + " Counts the number of rentals made by each customer during a selected month and year.\n", + " \n", + " Parameters:\n", + " df: DataFrame returned by rentals_month().\n", + " month: Integer representing the month.\n", + " year: Integer representing the year.\n", + " \n", + " Returns:\n", + " A DataFrame with customer_id and the number of rentals for that month.\n", + " \"\"\"\n", + " \n", + " column_name = f\"rentals_{month:02d}_{year}\"\n", + " \n", + " rental_count = (\n", + " df.groupby(\"customer_id\")\n", + " .size()\n", + " .reset_index(name=column_name)\n", + " )\n", + " \n", + " return rental_count" + ] + }, + { + "cell_type": "markdown", + "id": "b266b031", + "metadata": {}, + "source": [ + "## 7. Count Rentals for May and June" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ee18607e", + "metadata": {}, + "outputs": [], + "source": [ + "rentals_may_count = rental_count_month(rentals_may, 5, 2005)\n", + "rentals_june_count = rental_count_month(rentals_june, 6, 2005)\n", + "\n", + "rentals_may_count.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0b1baf9c", + "metadata": {}, + "outputs": [], + "source": [ + "rentals_june_count.head()" + ] + }, + { + "cell_type": "markdown", + "id": "e46ea476", + "metadata": {}, + "source": [ + "## 8. Function 3: Compare Rentals Between Two Months\n", + "\n", + "The function `compare_rentals()` combines both monthly rental count DataFrames.\n", + "\n", + "We use an `inner` merge because the challenge asks for customers who were active in both months.\n", + "\n", + "Then we calculate:\n", + "\n", + "```text\n", + "difference = rentals_06_2005 - rentals_05_2005\n", + "```\n", + "\n", + "This means:\n", + "\n", + "- Positive difference → the customer rented more in June.\n", + "- Negative difference → the customer rented more in May.\n", + "- Zero → the customer had the same number of rentals in both months." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c14d1813", + "metadata": {}, + "outputs": [], + "source": [ + "def compare_rentals(df1, df2):\n", + " \"\"\"\n", + " Combines two monthly rental count DataFrames and calculates the difference\n", + " between the number of rentals in the second month and the first month.\n", + " \n", + " Parameters:\n", + " df1: Rental count DataFrame for the first month.\n", + " df2: Rental count DataFrame for the second month.\n", + " \n", + " Returns:\n", + " A combined DataFrame with a difference column.\n", + " \"\"\"\n", + " \n", + " comparison = pd.merge(df1, df2, on=\"customer_id\", how=\"inner\")\n", + " \n", + " first_month_col = comparison.columns[1]\n", + " second_month_col = comparison.columns[2]\n", + " \n", + " comparison[\"difference\"] = comparison[second_month_col] - comparison[first_month_col]\n", + " \n", + " return comparison" + ] + }, + { + "cell_type": "markdown", + "id": "744192fb", + "metadata": {}, + "source": [ + "## 9. Compare May vs June Activity" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6b46bde7", + "metadata": {}, + "outputs": [], + "source": [ + "comparison_df = compare_rentals(rentals_may_count, rentals_june_count)\n", + "\n", + "comparison_df.head()" + ] + }, + { + "cell_type": "markdown", + "id": "5b796c63", + "metadata": {}, + "source": [ + "## 10. Analyze the Results\n", + "\n", + "Now that we have the comparison DataFrame, we can extract useful insights." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8c118436", + "metadata": {}, + "outputs": [], + "source": [ + "# Number of customers active in both months\n", + "active_both_months = comparison_df[\"customer_id\"].nunique()\n", + "\n", + "# Customers who increased, decreased, or maintained activity\n", + "increased_activity = comparison_df[comparison_df[\"difference\"] > 0].shape[0]\n", + "decreased_activity = comparison_df[comparison_df[\"difference\"] < 0].shape[0]\n", + "same_activity = comparison_df[comparison_df[\"difference\"] == 0].shape[0]\n", + "\n", + "activity_summary = pd.DataFrame({\n", + " \"activity_change\": [\"Increased in June\", \"Decreased in June\", \"Same activity\"],\n", + " \"number_of_customers\": [increased_activity, decreased_activity, same_activity]\n", + "})\n", + "\n", + "activity_summary" + ] + }, + { + "cell_type": "markdown", + "id": "36a1523b", + "metadata": {}, + "source": [ + "### Insight\n", + "\n", + "This table shows how customer behavior changed between May and June.\n", + "\n", + "It helps us identify whether customer engagement increased, decreased, or stayed stable among customers who were active in both months." + ] + }, + { + "cell_type": "markdown", + "id": "3113f318", + "metadata": {}, + "source": [ + "## 11. Top Customers with the Biggest Increase\n", + "\n", + "These are the customers whose rental activity increased the most from May to June." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "11179e6e", + "metadata": {}, + "outputs": [], + "source": [ + "top_increase = comparison_df.sort_values(by=\"difference\", ascending=False).head(10)\n", + "top_increase" + ] + }, + { + "cell_type": "markdown", + "id": "6ef1bede", + "metadata": {}, + "source": [ + "## 12. Customers with the Biggest Decrease\n", + "\n", + "These are the customers whose rental activity decreased the most from May to June." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "25a22125", + "metadata": {}, + "outputs": [], + "source": [ + "top_decrease = comparison_df.sort_values(by=\"difference\", ascending=True).head(10)\n", + "top_decrease" + ] + }, + { + "cell_type": "markdown", + "id": "16008747", + "metadata": {}, + "source": [ + "## 13. Average Rental Activity\n", + "\n", + "We can also compare the average number of rentals per customer in both months." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b9f2531f", + "metadata": {}, + "outputs": [], + "source": [ + "may_col = \"rentals_05_2005\"\n", + "june_col = \"rentals_06_2005\"\n", + "\n", + "average_activity = pd.DataFrame({\n", + " \"month\": [\"May 2005\", \"June 2005\"],\n", + " \"average_rentals_per_customer\": [comparison_df[may_col].mean(), comparison_df[june_col].mean()]\n", + "})\n", + "\n", + "average_activity" + ] + }, + { + "cell_type": "markdown", + "id": "192b59a9", + "metadata": {}, + "source": [ + "### Insight\n", + "\n", + "This helps us understand whether the same group of customers became more or less active on average in June compared to May." + ] + }, + { + "cell_type": "markdown", + "id": "0ea4d7e8", + "metadata": {}, + "source": [ + "## 14. Optional Visualization\n", + "\n", + "A simple bar chart can help visualize how many customers increased, decreased, or maintained their rental activity." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6aa720bb", + "metadata": {}, + "outputs": [], + "source": [ + "import matplotlib.pyplot as plt\n", + "\n", + "plt.figure(figsize=(8, 5))\n", + "plt.bar(activity_summary[\"activity_change\"], activity_summary[\"number_of_customers\"])\n", + "plt.title(\"Customer Rental Activity Change: May vs June 2005\")\n", + "plt.xlabel(\"Activity Change\")\n", + "plt.ylabel(\"Number of Customers\")\n", + "plt.xticks(rotation=20)\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "id": "60e8bb9f", + "metadata": {}, + "source": [ + "## 15. Final Conclusions\n", + "\n", + "Based on the comparison between May and June 2005:\n", + "\n", + "- We identified customers who were active in both months using an inner merge.\n", + "- We calculated how many rentals each customer made in May and June.\n", + "- We created a `difference` column to measure how customer activity changed.\n", + "- Customers with a positive difference were more active in June.\n", + "- Customers with a negative difference were more active in May.\n", + "- Customers with a difference of zero maintained the same rental behavior.\n", + "\n", + "## Business Interpretation\n", + "\n", + "This type of analysis is useful because it helps a company understand customer engagement over time.\n", + "\n", + "For a movie rental business like Sakila, this could help identify:\n", + "\n", + "- Customers becoming more engaged.\n", + "- Customers whose activity is declining.\n", + "- Opportunities for retention campaigns.\n", + "- Customers who may respond well to loyalty programs or personalized recommendations." + ] + }, + { + "cell_type": "markdown", + "id": "871c0369", + "metadata": {}, + "source": [ + "## 16. Key Takeaway\n", + "\n", + "Connecting Python to SQL allows us to combine the strengths of both tools:\n", + "\n", + "- SQL is useful for retrieving structured data from the database.\n", + "- Python and Pandas are useful for transforming, analyzing, and visualizing that data.\n", + "\n", + "This workflow is very common in data analytics projects." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.14.3" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}