Coding in Color

When You Wish Upon a Streamlit App: Disney Data continued

Wed, 04 Dec 2024 00:00:00 +0000

Data visualization gets even more memorable when it can be played with. Static images can show you a good overview, but if a picture's worth a thousand words, then a movie should be worth 100 million. Pull up the webapp side by side if you'd like to see the value. (And check out the GitHub repo for the full code I used.) Together we can answer the question: How has Disney been doing as a company over the last few years?

How to build a Streamlit app

You can attach your github account to streamlit and deploy an app directly from there by signing up for the free community cloud and signing in with your github account. Once you have an account, you can deploy an app by someone else, or initialize a new repo for your app, create a main.py file (or whatever you’d like to call it) and a requirements.txt file, and get to writing. If you’re making one for yourself, you’ll need to import the streamlit library, plotly for interactive plots, and probably a few others. My import list on main.py looked like this for this app:

import streamlit as st
import numpy as np
import pandas as pd
import plotly.express as px
import matplotlib.pyplot as plt
from io import BytesIO
import plotly.graph_objects as go
from plotly.subplots import make_subplots

Any and all libraries that you need for the app need to be included on new lines in your requirements.txt file (like this). Once all that is done, you can deploy your app and see what your code affects in real time.

Go to share.streamlit.io and click “Create app” in the top right corner. If you’re doing it through GitHub like me you’ll select the “Deploy a public app from GitHub”. Now select the repo that your code is in, select the main or master branch, select the title of your main.py file, and create a custom url for your app. The url I chose for my app is disney-stocks-and-movies.streamlit.app. Select deploy and you’ll be able to visit your app at the link you just created.

Before you do any data visualization you’ll need some data to visualize. Since I wanted to find out more about my Disney data, I read that in. In your main.py file under the import statements you can read in as many data sets as you want by using code that looks like this:

@st.cache_data  # the decorator so the app can read the data
def load_stock_data():  # function for data source
    url = 'https://github.com/KimmyBeeW/Disney-Web-Scraping/raw/main/datasets/all_disney_stocks.csv'
    stocks = pd.read_csv(url, index_col = 0)  # read it into a DataFrame
    return stocks  # return DataFrame

stocks = load_stock_data()  # call the function and assign to variable

And don’t forget to give your new app a title!

st.title("Disney Stocks and Disney Brands Box Office Numbers")

Play with the Data

It’s now time to add your interactive elements and make some discoveries about your data! You can use tools like a sidebar to really optimize your data visualization.

with st.sidebar:  # interactive side bar
    brands = st.radio('Brand name', ['Marvel', 'Lucasfilm', 'Pixar', 'Walt Disney Animation', 'Disney Channel', 'Disneytoon Studios', 'Disneynature', 'Blue Sky Studios'])

Or maybe you’d like to use different tabs to visualize your data. I always love a good line graph for time data:

tab1, tab2, tab3, tab4 = st.tabs(["Disney Stocks", "2nd tab", "3rd tab", "4th tab"]) # switch out the names obviously
with tab1:
    # slider for range of the graph dates
    startyear_input = st.slider('Start Year', min_value = 1962, max_value=2024, value=2021)
    endyear_input = st.slider('End Year', min_value = 1962, max_value=2024, value=2024)
    filt_stocks = stocks[(stocks['Date'] >= pd.Timestamp(startyear_input, 1, 1)) & 
                         (stocks['Date'] <= pd.Timestamp(endyear_input, 12, 31))].copy()
    # make it possible to see all four lines
    mlt_stocks = filt_stocks.melt(id_vars='Date', value_vars=['Close', 'Open', 'High', 'Low'], 
                                     var_name='Type', value_name='Price ($)')
    # custom colors
    val_colors = {'High': '#29b6f6', 'Low': '#a80930',
                     'Close': '#efb71d', 'Open': '#2bb007'}
    # plot the stocks
    fig = px.line(mlt_stocks, x='Date', y='Price ($)', color='Type',
                  title='Disney Stock Prices Over Time',
                  labels={'Price': 'Stock Price', 'Type': 'Price Type'},
                  color_discrete_map=val_colors)
    st.plotly_chart(fig)

Now I can see a wide range of years and really put into perspecive how Disney’s stocks are doing. Hovering over the peak allows me to see that Disney’s highest stock price in the last few years was $203 back in March of 2021 and that it hasn’t been that high since. But I could play with the slider and go check out how Disney was doing back in the 1960s if I wanted to.

Those sidebar buttons from earlier can be used to look at the data from different brands so you can see patterns like peaks in the late 2010s and how none of the newer movies have even gotten close to the same opening revenue. Of course part of that can be explained by the change in culture following the pandemic in 2020, but it also coincides with the release of Disney+ in November of 2019.

Interested in more ways to visualize your data? Me too! Streamlit has lots of resources for learning about their cool features like input widgets, tabs, and the sidebar. And of course if you want to continue learning from my code, you can always check out my repo and the code I used to implement the cool features in my streamlit app.

At Last I See The Light

Building my little app helped me see my data in a clearer light, and it really did help me understand the patterns I noticed before to a deeper level. The stocks are definitely sinking, and the movies are aligning with that pattern, but hopefully Disney can learn from its past mistakes and bring back the magic. For now I hope this little tutorial brought some magic to your life, and that you’ll continue to enjoy coding in color!

The Downfall of Disney? Once upon a Web Scraper

Wed, 13 Nov 2024 00:00:00 +0000

Okay, downfall may be a little dramatic, but if Disney continues at the rate it's going, I may not be far off. I used to be obsessed with the MCU. I loved watching all the theory videos on YouTube and was so excited every time a new movie was announced, but when, epic though it was, Endgame managed to kill off or fatally change many of my favorite characters, I knew the MCU, and Disney as a whole, would never be the same.

Around the same time in 2019 and 2020 Disney’s live-action remakes churned out some interesting choices with a not-at-all-live-action “live action” Lion King including “Can You Feel the Love Tonight” sung in the middle of the day and the titular character of Mulan’s fighting spirit being replaced with magical chi superpowers, in other words completely missing the points of some of the old stories. The animations are still beautiful and the stories are still enjoyable, but it got me wondering if those changes, and the apparent emphasis on quantity over quality, was starting to hurt Disney as a business. So I took my newfound web-scraping skills to the Box Office Mojo website and Yahoo! Finance to see what the movie theater goers and stock holders had to say.

How has Disney been doing as a company over the last few years?

Disney does more than just movies, so let’s look at how the shareholders view Disney’s success. I got this stock data from Yahoo! Finance, and you can see that, other than the obvious dip caused by the pandemic in March of 2020, Disney was on the rise until around February 2021, then fell again November of 2021, and hasn’t gone above $130 since the beginning of 2022. Which incidentally aligns with the terror that was “Doctor Strange in the Multiverse of Madness” (May 2022). It could be coincindence though.

Because of the rise of inflation and the noteriaty of the company, Disney is still doing better than pre-2014, yet further evidence against them falling anytime soon. Loyal fans are still clinging on to the nostalgia of the past and the merch of the present, and there is no doubt that Disney is still a successful company. It’s just an interesting pattern to note before diving into the success of current movies.

A tale as old as Box Office numbers

Now that we have an overview of what the shareholders think, let’s look at what actual movie-goers think. Box Office Mojo has a really awesome collection of brands, their respective movies, and info about those movies’ lifetime gross income to date, max number of theaters, opening weekend gross income, opening number of theaters, release date, and distributer. It’s fun to look through their website, but I really wanted an aggregate of brands owned by disney, so I used web-scraping to look at the following brands from the Box Office Mojo Website:

Marvel Comics – Disney acquired Marvel Entertainment in 2009.
Lucasfilm – Disney purchased Lucasfilm in 2012.
Pixar – Acquired by Disney in 2006.
Walt Disney Animation Studios
Blue Sky Studios – Disney acquired Blue Sky as part of the 2019 purchase of 21st Century Fox, but it was shut down in 2021.
Disney Channel
DisneyToon Studios – A division of Disney, closed in 2018.
Disneynature

Mini Web-scraping tutorial for Python

When web-scraping data online, you must ALWAYS check the robots.txt file. It tells you what you can and should not scrape, and how fast you can scrape it. Nearly every domain has one, and if they don’t it is common curtesy to add a sleep timer. (Here is some example code since I didn’t need to use one in my web-scraping):

websites = []

for link in links:
    time.sleep(10)  # sleep timer for 10 seconds
    r = requests.get(link)
    bs = BeautifulSoup(r3.text)
    try:
        website = bs.find('div', {'class': 'unique_tag_text_from_html'}).find('a')['href']
    except:
        website = None
    websites.append(website)

I checked robots.txt by using my RequestGuard class in python to check that the urls I was scraping from weren’t part of the forbidden list, but I recently learned about the urllib in python that can do the same thing for you. Check out this brief overview to learn more. Everthing was good for my urls so I moved on to the actual data gathering.

I used bs4.BeautifulSoup, but if the website you’re trying to scrape has buttons that need clicking, you’ll want to use Selenium.

Now you simply inspect the pages HTML by right clicking on the page, hovering over the HTML parts until the data you want to scrape is highlighted and find the unique tags attched to those data points. For readability, I’d also recommend converting the data to a pandas DataFrame.

My code looks like this:

url = "https://www.boxofficemojo.com/brand/bn3732077058/"
rg = RG(url)
if rg.can_follow_link(url):
    print("robot.txt allows scraping for this page") 

r = requests.get(url)
print(r.status_code)  # if it's 200 we're good to go

soup = BeautifulSoup(r.text)  # this gets the whole html soup object
container = soup.find('div', {'class': 'a-section imdb-scroll-table mojo-gutter'})  # a smaller chunk to make it easier to find names

Once you have a container, you get to play “find the pieces” with your data. When you know what the rows are contained in (usually something like a ‘tr’ table row tag.) Then you can iterate through and make a dataframe:

items = container.find_all('tr')
ranks, titles, gross, max_theaters, opening_earnings, opening_num_thtrs, release_dates, studios = [], [], [], [], [], [], [], []
for row in items[1:]:
    rank = row.find('td', class_='mojo-field-type-rank').text
    ranks.append(rank)
    title = row.find('td', class_='mojo-field-type-release').text
    titles.append(title)
    life_gross = row.find  # etc etc (check out repo for full code)

# combine the lists to make a pandas DataFrame
df = pd.DataFrame({'Rank': ranks, 'Title': titles, 'Gross Income': gross, 'Max Theaters': max_theaters, 'Opening Earnings': opening_earnings, 'Opening Num Theaters': opening_num_thtrs, 'Release Dates': release_dates, 'Studio': studios}).drop_duplicates().reset_index(drop=True)

Drum roll for the data pulled

After scraping the data from the different brands, I combined it into a csv with 204 movies, reranked them to align with the combination, and called it disney_owned_movies.csv. I realized that there were plenty of movies missing from the dataset not attatched to brands listed on Box Office Mojo, such as the Chronicles of Narnia movies, but my main points of interest revolved around Marvel, Pixar, and Walt Disney Animation, so I’ll leave adding the other Disney movies for you to explore if you’d like.

Without further ado, here are the highlights, and you can decide what you think of Disney’s fate. Only time will truly tell.

Marvel is clearly Disney’s highest grossing brand at the moment, so it was a wise purchasing descision, but it also means Marvel plays a huge role in Disney’s success.

And the number of opening theaters doesn’t always spell success, but it is interesting to note the power of releasing studio influence.

No surprise here, the movie attributes are pretty highly correlated (just look at the bottom left triangle) meaning that they’re all affected by the same thing or by each other.

And the one we’ve all been waiting for, Opening Earnings over Time: And the zoomed in version:

Notice the peaks in 2019 and 2022? Maybe I was right after all. The beginnings of a pattern are occuring and it doesn’t look too good for our heroes, especially when compared to the stock data we gathered earlier.

For lifelong Disney fans, this data might evoke nostalgia for the golden era of box-office hits, but hopefully there is more good to come. It makes one wonder how the company will innovate in response to these challenges. It will be interesting to see if Disney pivots its strategies or doubles down on its streaming services. Could this signal the end of Disney’s box-office dominance? Only time will tell.

Want to do it yourself?

Check out my repo with all of the code I used to webscrape the Box Offic Mojo website, my RequestGuard file for parsing the robots.txt files, the data I gathered, and some helpful links I found. Web scraping is a lot easier than I remembered it being, it just takes some puzzling and time. Go analyze data about your own interests, and always remember to have fun coding in color!

Data's Paintbrush in Python

Fri, 20 Sep 2024 00:00:00 +0000

The Data Visualization aspect of Data Science can be scary with how many tools, IDEs, coding languages, and platforms that are available. Plus, if you're into Data Science, you may already be familiar with R and RStudio, but outside of stats, R is hardly used, and it's always good to flesh out your portfolio of skills. So take a deep breath. Let's go back to basics and start with one of the most popular coding languages today: PYTHON

Where to get started: set up

Since this is a tutorial on how to create simple plots in Python we need to have a place to use it, so if you’re coding in python, you’ll really want an interface. It doesn’t need to be anything fancy. You can use an IDE like PyCharm, a text-editor like VS Code, or a browser-based tool like Google Collab. All three of which have free options: for PyCharm Community Edition scroll down to Community Edition and select the dropdown for your computer type, for VS Code just select the download button, and Google Colab is available to anyone with a google account.

Now you’ll want to install and import a couple of libraries. For the sake of this tutorial, we’re going to use matplotlib.pyplot, and numpy. If it’s your first time using these libraries in an IDE, Text Editor, or CLI, you’ll need to install them. This line in the computer terminal is how I install it on my mac (same for a Linux): `python3 -m pip install "matplotlib"` Pretty similar line for a windows: `py -m pip install "numpy"`

Or you can follow this tutorial by Python if you get stuck.

Once the library has been installed (which you’ll only need to do once), you’ll import the library and shorten the name you use to reference. Import libraries at the top of your file:

import matplotlib.pyplot as plt
import numpy as np

Now we’re ready to get graphing!

Plot Production

Line Graphs

Line Graphs are a great way to visualize trends in data. They’re used for line of best fit, displaying the relationship between simple datasets or two parts of a more complex dataset. The x-axis is typically used to measure the time over which data is measured. Examples of line graphs include: stocks over hours, weight over months, price over season, # of ticket sales per day, etc.

The first part of the code for the line graph is defining our line graph function and looks like this:

def line(x_points, y_points, color):
    plt.plot(x_points, y_points, color = color)
    plt.show()

.plot and .show come from the matplotlib library and accessed using our shortcut “plt”

The function by itself doesn’t do anything because we need to define the data. You can do that using two lists of the same length:

x = [1, 2, 3, 4, 5, 6, 7, 8, 9]
y = [1, 4, 4, 3, 6, 8, 9, 11, 13]
line(x, y, 'b')  # calls on the function we just made

Or you can define a matrix/array with numpy:

data = np.array([[1, 13], [2, 11], [3, 10], [4, 8], [5, 7], [6, 6], [7, 4], [8, 2], [9, -1]])
line(data[:, 0], data[:, 1], 'g')

What if you want to add more that one line? Well, matplotlib.pyplot offers many colors which can help you distinguish lines, and putting multiple plt.plots before a plt.show will let you put multiple lines on a single graph.

def colors():
    x_points = [0, 1, 2, 3, 4, 5]
    line1_y_points = [1, 2, 3, 4, 5, 6]
    line2_y_points = [2, 3, 4, 5, 6, 7]
    line3_y_points = [3, 4, 5, 6, 7, 8]
    line4_y_points = [4, 5, 6, 7, 8, 9]
    line5_y_points = [5, 6, 7, 8, 9, 10]
    line6_y_points = [6, 7, 8, 9, 10, 11]
    line7_y_points = [7, 8, 9, 10, 11, 12]
    line8_y_points = [8, 9, 10, 11, 12, 13]

    plt.plot(x_points, line1_y_points, 'r')  # red
    plt.plot(x_points, line2_y_points, 'g')  # green
    plt.plot(x_points, line3_y_points, 'b')  # blue
    plt.plot(x_points, line4_y_points, 'c')  # cyan
    plt.plot(x_points, line5_y_points, 'm')  # magenta
    plt.plot(x_points, line6_y_points, 'y')  # yellow
    plt.plot(x_points, line7_y_points, 'k')  # black
    plt.plot(x_points, line8_y_points, 'w')  # white

    plt.show()

colors()

Adding labels adds even more info! You could say it’s pretty cool.

def multi_lines_and_labels():
    x_points = [0, 1, 2, 3, 4, 5]
    y1_points = [6, 7, 8, 9, 10, 10]
    y2_points = [4, 5, 6, 5, 7, 10]

    plt.plot(x_points, y1_points, label="You", color = 'b')
    plt.plot(x_points, y2_points, label="Me", color = 'r')

    plt.title("Our Coolness Levels")
    plt.xlabel("Months learning Data Science")
    plt.ylabel("Coolness Level")

    plt.legend()
    plt.show()

multi_lines_and_labels()

Scatter Plots

Another way to visualize the relationship between two data catagories is a scatter plot. Rather than looking at the general trend line, scatter plots allow us to see points and their density. It also makes it easier to see points that have repeated x or y values.

def scatter(x_points, y_points):
    plt.scatter(x_points, y_points)
    plt.show()


x3 = [1, 2, 2, 3, 4, 4, 4, 5, 6, 6, 7]
y3 = [1, 3, 2, 3, 2, 4, 5, 5, 5.5, 6.5, 7]
scatter(x3, y3)

Bar Graphs

Bar Graphs are a way to visualize the counts of your different factor levels or categorical data.

If you had a bag of fruit, and wanted to see how many of each type you have, bar graphs are a way to see that! In this example we have a lot more Apples than any other fruit.

def bar(categories, counts):
    plt.bar(categories, counts)
    plt.show()


fruit = ['Apple', 'Banana', 'Kiwi', 'Orange']
f_counts = [5, 1, 3, 1]
bar(fruit, f_counts)

Histograms

def hist(frequencies):
  plt.hist(frequencies, [1, 2, 3, 4, 5, 6], color = 'g')
  plt.show()

freq = [
    1, 1, 1, 1, 1, 1,  # 6 ones
    2, 2, 2,  # 3 twos
    3,  # 1 three
    4, 4,  # 2 fours
    5  # 1 five
]
hist(freq)

Density Plot

A density plot is used for the same thing as a histogram, but is a more accurate way to view the changes in data.

import numpy as np
import seaborn as sns
data = [1.5]*7 + [2.5]*2 + [3.5]*8 + [4.5]*3 + [5.5]*1 + [6.5]*8
sns.set_style('whitegrid')
sns.kdeplot(np.array(data), bw_method=0.5)

Pie Charts: The Bane of Statisticians Existances

Seriously, do not use pie charts for stats. Pie charts make it hard to actually tell the proportions between real data, so this is only for your information. Use at your own risk.

def pie():
    counts = [4, 1, 2, 3]
    plt.pie(counts)
    plt.show()
pie()

You can do it!

Granted, those were just the basics of data visualization in python, but everyone’s got to start somewhere. In my experience, I learn how to code best when I copy someone elses code and turn it into my own thing, so I invite you to do the same. Get PyCharm, VS Code, or Google Collab and try out these graphs for yourself. Switch up the numbers, colors, and data, and don’t be afraid to use Google or ChatGPT for ideas on how to expand the functionality of the graphs. If you want to go even further, explore these other data visualization libraries.

Thanks for being willing to explore data visualization in python with me. Good luck coding in color!