Mastering Web Scraping with Python: JSON Data Handling

Chapter 1: Introduction to JSON Web Scraping

In this article, we will delve into the organization of JSON response objects obtained through web scraping with Python's requests library, building on my previous articles.

Before diving in, I recommend checking out my earlier pieces, especially the one that explores various scraping techniques. Today, we'll employ a straightforward method by executing a GET request to an API endpoint. Our focus will be on scraping live football matches from PaddyPower, similar to the previous tutorial.

First, we'll initiate a request to the endpoint, as demonstrated below, and parse the response into a JSON format.

import requests

import json

# Replace 'your_endpoint_url' with the actual URL you want to call

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36'}

# Make a GET request

response = requests.get(endpoint_url, headers=headers)

json_data = response.json()

Next, we will create arrays to store our data:

event_ids = []

match_ids = []

fixtures = []

odds_matches = []

competitions = []

market_types = []

market_names = []

match_dates_and_time = []

time_scraped = []

Now, we move on to the more intricate part of the tutorial: navigating through the JSON data. If you're not well-versed in JSON structure, it is akin to how JavaScript organizes objects. JSON consists of key-value pairs, which, while seemingly straightforward, can become complex due to nested data.

From our analysis, we can see that the "attachments" object contains crucial information such as competition names and IDs. Moving further, the "events" object holds match IDs and fixture names.

Let’s store the event data in a variable called data (a more descriptive name would be better). We can iterate through the event data, which corresponds to the match IDs, and append these to our list.

data = json_data['attachments']['events']

for i in data:

event_ids.append(i)

We will then construct the request URLs to retrieve market and odds data for each match:

end_of_url = '&exchangeLocale=en_GB&includeBettingOpportunities=true&includePrices=true&includeSeoCards=true&includeSeoFooter=true&language=en&loggedIn=false&priceHistory=1®ionCode=UK'

for i in event_ids:

url = start_of_url + i + end_of_url

print(url)

match_response = requests.get(url, headers=headers)

match_data = match_response.json()

Next, we will extract specific data such as the match name (fixture) and the competition name:

match_name = match_data['attachments']['events']

keys_list = list(match_name)

event_data = match_name.get(keys_list[0], {})

event_name = event_data.get('name', 'N/A')

event_time = event_data.get('openDate', 'N/A')

competition_name = match_data['attachments']['competitions']

keys_list = list(competition_name)

competition_data = competition_name.get(keys_list[0], {})

competition_name_new = competition_data.get('name', 'N/A')

competition_id = competition_data.get('competitionId', 'N/A')

We also gather market keys to scrape from the data:

markets = match_data['attachments']['markets'].keys()

for j in markets:

markets_new = match_data['attachments']['markets'][j]['marketName']

selections = match_data['attachments']['markets'][j]['runners']

for k in selections:

fixtures.append(event_name)

competitions.append(competition_name_new)

market_types.append(markets_new)

match_dates_and_time.append(event_time)

match_ids.append(i)

name = k['runnerName']

market_names.append(name)

current_time = datetime.now()

time_scraped.append(current_time)

print(name)

try:

odds = k['winRunnerOdds']['trueOdds']['decimalOdds']['decimalOdds']

except:

odds = 'Issue with Odds'

odds_matches.append(odds)

Finally, we will create a DataFrame to store all the collected data:

columns = ['Match ID', 'Fixture', "Match Time and Date", "Competition", "Market Type", "Market Name", "Market Odds", "Time Scraped"]

# Initialize a new DataFrame with columns

new_dataframe = pd.DataFrame(columns=columns)

# Add arrays to columns

new_dataframe['Match ID'] = match_ids

new_dataframe['Fixture'] = fixtures

new_dataframe['Match Time and Date'] = match_dates_and_time

new_dataframe['Competition'] = competitions

new_dataframe['Market Type'] = market_types

new_dataframe['Market Name'] = market_names

new_dataframe['Market Odds'] = odds_matches

new_dataframe['Time Scraped'] = time_scraped

new_dataframe

The output will look like this.

As always, feel free to reach out with any questions or feedback on this article. If you found it helpful, please give it a clap and follow for more!

Chapter 2: Video Resources

Explore how to scrape JSON data embedded in SCRIPT tags in this tutorial.

Learn to scrape live scores without the need for BeautifulSoup or Selenium.

dogmadogmassage.com

Mastering Web Scraping with Python: JSON Data Handling

Chapter 1: Introduction to JSON Web Scraping

Chapter 2: Video Resources

Share the page:

Recent Post:

Learning Made Easy: Discovering Micro and Nano-Learning Techniques

Strategies for Overcoming Brain Fog: A Comprehensive Guide

SpaceX Acknowledges Dragon Capsule Destruction During Testing

Understanding the Key Differences Between Solopreneurs and Entrepreneurs

Regaining His Interest: Three Key Strategies to Avoid Loss

Galactica: Meta's AI Misstep and Its Potential Consequences

The Terrifying Power of Nature: A Haunting Reflection

Mastering Your Squat: Three Quick Fixes for Better Mobility