Tri-3-Lesson
Reviewing Data Analysis.
- What you should Have to Start
- Lesson Portion 1: ReIntroduction to Data Analysis, NunPy, and Pandas, Why is it important?
- Lesson Portion 2 More into NunPy
- Lesson Portion 3 More into Pandas
- What we are Covering
- What are pandas and what is its purpose?
- Things you can do using pandas
- Pandas and Data analysis
- Dataframes
- Importing CSV Data
- In this code segment below we use Pandas to read a CSV file containing NBA player statistics and store it in a DataFrame.
- The reason Pandas is useful in this scenario is because it provides various functionalities to filter, sort, and manipulate the NBA data efficiently. In this code, the DataFrame is filtered to only include the stats for the player you guys choose.
- Importing CSV Data
- Lesson Portion 4
- Lesson Portion 5; Summary
- Lesson Portion 6 Hacks
What you should Have to Start
- Should have wget this file (tri3-lesson.ipynb)
- wget this file: https://raw.githubusercontent.com/JoshuaW03628/Repository-1/master/nba_player_statistics.csv
- Copy Path from nba_player_statistics.csv and replace prior path for it.
Data Analysis is the process of examining data sets in order to find trends and draw conclusions about the given information. Data analysis is important because it helps businesses optimize their performances.
Pandas library involves a lot of data analysis in Python. NumPy Library is mostly used for working with numerical values and it makes it easy to apply with mathematical functions.
NumPy is a tool in Python that helps with doing math and data analysis. It's great for working with large amounts of data, like numbers in a spreadsheet. NumPy is really good at doing calculations quickly and accurately, like finding averages, doing algebra, and making graphs. It's used a lot by scientists and people who work with data because it makes their work easier and faster.
import numpy as np
This code calculates the total plate appearances for a baseball player using NumPy's sum() function, similar to the original example. It then uses NumPy to calculate the total number of bases (hits plus walks) for the player, and divides that by the total number of plate appearances to get the on-base percentage. The results are then printed to the console.
import numpy as np
# Example data
player_hits = np.array([3, 1, 2, 0, 1, 2, 1, 2]) # Player's hits in each game
player_walks = np.array([1, 0, 0, 1, 2, 1, 1, 0]) # Player's walks in each game
player_strikeouts = np.array([2, 1, 0, 2, 1, 1, 0, 1]) # Player's strikeouts in each game
# array to store plate appearances (PA) for the player
total_pa = np.sum(player_hits != 0) + np.sum(player_walks) + np.sum(player_strikeouts)
# array to store on-base percentage (OBP) for the player
total_bases = np.sum(player_hits) + np.sum(player_walks)
obp = total_bases / total_pa
# Print the total plate appearances and on-base percentage for the player
print(f"Total plate appearances: {total_pa}")
print(f"On-base percentage: {obp:.3f}")
import numpy as np
#Create a NumPy array of the heights of players in a basketball team
heights = np.array([192, 195, 193, 200, 211, 199, 201, 198, 184, 190, 196, 203, 208, 182, 207])
# Calculate the percentile rank of each player's height
percentiles = np.percentile(heights, [25, 50, 75])
# Print the results
print("The 25th percentile height is", percentiles[0], "cm.")
print("The 50th percentile height is", percentiles[1], "cm.")
print("The 75th percentile height is", percentiles[2], "cm.")
# Determine the number of players who are in the top 10% tallest
top_10_percent = np.percentile(heights, 90)
tallest_players = heights[heights >= top_10_percent]
print("There are", len(tallest_players), "players in the top 10% tallest.")
import numpy as np
#Create a NumPy array of the x
x = np.array([])
# Calculate the percentile rank of x
y = np.percentile(x, [1,2,3])
# Print the results
print("", percentiles[0], "")
print("", percentiles[1], "")
print("", percentiles[2], "")
# Determine the number of players who are in the top 10% x
t = np.percentile(x, 90)
z = x[x >= t]
print("There are", len(z), "players in the top 10% (x).")
Lesson Portion 3 More into Pandas
What we are Covering
- Explanation of Pandas and its uses in data analysis
- Importing Pandas library
- Loading data into Pandas DataFrames from CSV files
- Manipulating and exploring data in Pandas DataFrames
- Example of using Pandas for data analysis tasks such as filtering and sorting
Things you can do using pandas
- Data Cleansing; Identifying and correcting errors, inconsistencies, and inaccuracies in datasets.
- Data fill; Filling in missing values in datasets.
- Statistical Analysis; Analyzing datasets using statistical techniques to draw conclusions and make predictions.
- Data Visualization; Representing datasets visually using graphs, charts, and other visual aids.
- Data inspection; Examining datasets to identify potential issues or patterns, such as missing data, outliers, or trends.
Pandas and Data analysis
The 2 most important data structures in Pandas are:
- Series ; A Series is a one-dimensional labeled array that can hold data of any type (integer, float, string, etc.). It is similar to a column in a spreadsheet or a SQL table. Each element in a Series has a label, known as an index. A Series can be created from a list, a NumPy array, a dictionary, or another Pandas Series.
- DataFrame ;A DataFrame is a two-dimensional labeled data structure that can hold data of different types (integer, float, string, etc.). It is similar to a spreadsheet or a SQL table. Each column in a DataFrame is a Series, and each row is indexed by a label, known as an index. A DataFrame can be created from a dictionary of Series or NumPy arrays, a list of dictionaries, or other Pandas DataFrame.
import pandas as pd
pd.__version__
import pandas as pd
# Load the CSV file into a Pandas DataFrame
df = pd.read_csv('/Users/josh/Repository-1/nba_player_statistics.csv')
# Filter the DataFrame to only include stats for a specific player (in this case, Jimmy Butler)
player_name = 'Jimmy Butler'
player_stats = df[df['NAME'] == player_name]
# Display the stats for the player
print(f"\nStats for {player_name}:")
print(player_stats[['PPG', 'RPG', 'APG']])
In this code segment below we use Pandas to read a CSV file containing NBA player statistics and store it in a DataFrame.
The reason Pandas is useful in this scenario is because it provides various functionalities to filter, sort, and manipulate the NBA data efficiently. In this code, the DataFrame is filtered to only include the stats for the player you guys choose.
import pandas as pd
df = pd.read_csv('/Users/josh/Repository-1/nba_player_statistics.csv')
# Load CSV file into a Pandas DataFrame
player_name = input("Enter player name: ")
# Get player name input from user
player_stats = df[df['NAME'] == player_name]
# Filter the DataFrame to only include stats for the specified player
if player_stats.empty:
print("No stats found for that player.")
else:
# Check if the player exists in the DataFrame
print(f"\nStats for {player_name}:")
print(player_stats[['PPG', 'RPG', 'APG', 'P+R+A']])
# Display the stats for the player inputted.
import numpy as np
import pandas as pd
# Load CSV file into a Pandas DataFrame
df = pd.read_csv('nba_player_stats.csv')
# Filter the DataFrame to only include stats for the specified player
player_name = input("Enter player name: ")
player_stats = df[df['NAME'] == player_name]
if player_stats.empty:
print("No stats found for that player.")
else:
# Convert the player stats to a NumPy array
player_stats_np = np.array(player_stats[['PPG', 'RPG', 'APG', 'P+R+A']])
# Calculate the average of each statistic for the player
player_stats_avg = np.mean(player_stats_np, axis=0)
# Print out the average statistics for the player
print(f"\nAverage stats for {player_name}:")
print(f"PPG: {player_stats_avg[0]:.2f}")
print(f"RPG: {player_stats_avg[1]:.2f}")
print(f"APG: {player_stats_avg[2]:.2f}")
print(f"P+R+A: {player_stats_avg[3]:.2f}")
In the code you provided, pandas is used to load a CSV file containing NBA player stats into a DataFrame object.
Once the data is loaded into the DataFrame, pandas is used to filter the data to only include the stats for a specific player, based on the input entered by the user. The filtered data is then stored in a new DataFrame object called player_stats.
NumPy impacts the given code because it performs operations on arrays efficiently. Specifically, it converts a Pandas DataFrame object to a NumPy array object, and then calculates the average statistics for a the player you guys inputted. Without NumPy, it would be more difficult and less efficient to perform these calculations on large data sets. It does the math for us.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Load the CSV file into a Pandas DataFrame
df = pd.read_csv('nba_player_stats.csv')
# Print the first 5 rows of the DataFrame
print(df.head())
# Calculate the mean, median, and standard deviation of the 'Points' column
mean_minutes = df['MPG'].mean()
median_minutes = df['MPG'].median()
stddev_minutes = df['MPG'].std()
# Print the results
print('Mean Minutes: ', mean_minutes)
print('Median Minutes: ', median_minutes)
print('Standard Deviation Minutes: ', stddev_minutes)
# Create a histogram of the 'Points' column using Matplotlib
plt.hist(df['MPG'], bins=20)
plt.title('MPG Histogram')
plt.xlabel('MPG')
plt.ylabel('Frequency')
plt.show()
This is a computer program that helps us look at information about basketball players. The program uses some special tools called NumPy, Pandas, and Matplotlib to help us understand the information better. The program first reads a file that has all the information about basketball players, then it shows us the first 5 rows of that information.
Next, the program helps us find the average (mean), middle (median), and how spread out (standard deviation) the amount of time players spent on the court (MPG) is. The program then shows us those numbers.
Finally, the program shows us a picture (graph) that helps us see how often players spent different amounts of time on the court. The picture has a title that tells us what it is about, labels on the bottom and side that tell us what the picture is showing, and some bars that help us see how often players spent different amounts of time on the court.
The graph shows the average minutes per game and the amount of games they placed that many minutes.
Summary/Goals of Lesson:
One of our goals was to make you understand data analysis and how it can be important in optimizing business performance. We also wanted to make sure you understood the use of Pandas and NumPy libraries in data analysis, with a focus on NumPy. As someone who works with data, we find Pandas incredibly useful for manipulating, analyzing, and visualizing data in Python. The way we use pandas is to calculate individual player and team statistics. We are a group that works with numerical data, so NumPy is one of our favorite tools for working with arrays and applying mathematical functions to them. It is very fast at computing and manipulating arrays making it a very valuable tool for tracking statistics which is important to our group. For example, if you have an array of the points scored by each player in a game, you can use NumPy to calculate the total points scored, average points per player, or the highest and lowest scoring players.
Lesson Portion 6 Hacks
Printing a CSV File (0.5)
- Use this link https://github.com/ali-ce/datasets to select a topic you are interested in, or you may find one online.
- Once you select your topic make sure it is a csv file and then you want to press on the button that says raw.
- After that copy that information and create a file with a name and .csv at the end and paste your information.
- Below is a start that you can use for your hacks.
- Your goal is to print 2 specific parts from data (example could be like population and country).
Popcorn Hacks (0.2)
- Lesson Portion 1. #### Answering Questions (0.2)
- Found Below.
Submit By Thursday 8:35 A.M.
- How to Submit: Slack a Blog Post that includes all of your hacks to "Joshua Williams" on Slack.
import pandas as pd
# read the CSV file
df = pd.read_csv("blank.csv")
# display the data in a table
print(df)
Question Hacks;
What is NumPy and how is it used in data analysis?
What is Pandas and how is it used in data analysis?
How is NunPy different than Pandas for data analysis?
What is a DataFrame?
What are some common operations you can perform with NunPy?
How Can You Incorporate Either of these Data Analysis Tools (NunPy, Pandas) into your project?