dogmadogmassage.com

Mastering Web Scraping with Python: JSON Data Handling

Written on

Chapter 1: Introduction to JSON Web Scraping

In this article, we will delve into the organization of JSON response objects obtained through web scraping with Python's requests library, building on my previous articles.

Before diving in, I recommend checking out my earlier pieces, especially the one that explores various scraping techniques. Today, we'll employ a straightforward method by executing a GET request to an API endpoint. Our focus will be on scraping live football matches from PaddyPower, similar to the previous tutorial.

First, we'll initiate a request to the endpoint, as demonstrated below, and parse the response into a JSON format.

import requests

import json

# Replace 'your_endpoint_url' with the actual URL you want to call

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36'}

# Make a GET request

response = requests.get(endpoint_url, headers=headers)

json_data = response.json()

Next, we will create arrays to store our data:

event_ids = []

match_ids = []

fixtures = []

odds_matches = []

competitions = []

market_types = []

market_names = []

match_dates_and_time = []

time_scraped = []

Now, we move on to the more intricate part of the tutorial: navigating through the JSON data. If you're not well-versed in JSON structure, it is akin to how JavaScript organizes objects. JSON consists of key-value pairs, which, while seemingly straightforward, can become complex due to nested data.

From our analysis, we can see that the "attachments" object contains crucial information such as competition names and IDs. Moving further, the "events" object holds match IDs and fixture names.

Let’s store the event data in a variable called data (a more descriptive name would be better). We can iterate through the event data, which corresponds to the match IDs, and append these to our list.

data = json_data['attachments']['events']

for i in data:

event_ids.append(i)

We will then construct the request URLs to retrieve market and odds data for each match:

end_of_url = '&exchangeLocale=en_GB&includeBettingOpportunities=true&includePrices=true&includeSeoCards=true&includeSeoFooter=true&language=en&loggedIn=false&priceHistory=1®ionCode=UK'

for i in event_ids:

url = start_of_url + i + end_of_url

print(url)

match_response = requests.get(url, headers=headers)

match_data = match_response.json()

Next, we will extract specific data such as the match name (fixture) and the competition name:

match_name = match_data['attachments']['events']

keys_list = list(match_name)

event_data = match_name.get(keys_list[0], {})

event_name = event_data.get('name', 'N/A')

event_time = event_data.get('openDate', 'N/A')

competition_name = match_data['attachments']['competitions']

keys_list = list(competition_name)

competition_data = competition_name.get(keys_list[0], {})

competition_name_new = competition_data.get('name', 'N/A')

competition_id = competition_data.get('competitionId', 'N/A')

We also gather market keys to scrape from the data:

markets = match_data['attachments']['markets'].keys()

for j in markets:

markets_new = match_data['attachments']['markets'][j]['marketName']

selections = match_data['attachments']['markets'][j]['runners']

for k in selections:

fixtures.append(event_name)

competitions.append(competition_name_new)

market_types.append(markets_new)

match_dates_and_time.append(event_time)

match_ids.append(i)

name = k['runnerName']

market_names.append(name)

current_time = datetime.now()

time_scraped.append(current_time)

print(name)

try:

odds = k['winRunnerOdds']['trueOdds']['decimalOdds']['decimalOdds']

except:

odds = 'Issue with Odds'

odds_matches.append(odds)

Finally, we will create a DataFrame to store all the collected data:

columns = ['Match ID', 'Fixture', "Match Time and Date", "Competition", "Market Type", "Market Name", "Market Odds", "Time Scraped"]

# Initialize a new DataFrame with columns

new_dataframe = pd.DataFrame(columns=columns)

# Add arrays to columns

new_dataframe['Match ID'] = match_ids

new_dataframe['Fixture'] = fixtures

new_dataframe['Match Time and Date'] = match_dates_and_time

new_dataframe['Competition'] = competitions

new_dataframe['Market Type'] = market_types

new_dataframe['Market Name'] = market_names

new_dataframe['Market Odds'] = odds_matches

new_dataframe['Time Scraped'] = time_scraped

new_dataframe

The output will look like this.

As always, feel free to reach out with any questions or feedback on this article. If you found it helpful, please give it a clap and follow for more!

Chapter 2: Video Resources

Explore how to scrape JSON data embedded in SCRIPT tags in this tutorial.

Learn to scrape live scores without the need for BeautifulSoup or Selenium.

Share the page:

Twitter Facebook Reddit LinkIn

-----------------------

Recent Post:

Learning Made Easy: Discovering Micro and Nano-Learning Techniques

Explore how Micro and Nano-learning transform education, making it engaging and accessible through bite-sized techniques.

Strategies for Overcoming Brain Fog: A Comprehensive Guide

Discover effective strategies to combat brain fog and enhance cognitive clarity through lifestyle changes and medical support.

SpaceX Acknowledges Dragon Capsule Destruction During Testing

SpaceX confirms the destruction of its Dragon II capsule during a test, shedding light on the incident and the implications for future missions.

Understanding the Key Differences Between Solopreneurs and Entrepreneurs

Explore the distinctions between solopreneurs and entrepreneurs, including their approaches to business, risk, and growth.

Regaining His Interest: Three Key Strategies to Avoid Loss

Discover effective strategies to make a man regret losing you and avoid being blindsided in relationships.

Galactica: Meta's AI Misstep and Its Potential Consequences

Meta's AI, Galactica, aimed to transform scientific research but faced backlash for inaccuracies, raising concerns about the ethics of AI technology.

The Terrifying Power of Nature: A Haunting Reflection

An exploration of nature's formidable forces and their impact on humanity, illustrated with striking imagery and videos.

Mastering Your Squat: Three Quick Fixes for Better Mobility

Discover how active mobility can enhance your squat with these three simple exercises.