Mastering Web Scraping with Python: JSON Data Handling
Written on
Chapter 1: Introduction to JSON Web Scraping
In this article, we will delve into the organization of JSON response objects obtained through web scraping with Python's requests library, building on my previous articles.
Before diving in, I recommend checking out my earlier pieces, especially the one that explores various scraping techniques. Today, we'll employ a straightforward method by executing a GET request to an API endpoint. Our focus will be on scraping live football matches from PaddyPower, similar to the previous tutorial.
First, we'll initiate a request to the endpoint, as demonstrated below, and parse the response into a JSON format.
import requests
import json
# Replace 'your_endpoint_url' with the actual URL you want to call
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36'}
# Make a GET request
response = requests.get(endpoint_url, headers=headers)
json_data = response.json()
Next, we will create arrays to store our data:
event_ids = []
match_ids = []
fixtures = []
odds_matches = []
competitions = []
market_types = []
market_names = []
match_dates_and_time = []
time_scraped = []
Now, we move on to the more intricate part of the tutorial: navigating through the JSON data. If you're not well-versed in JSON structure, it is akin to how JavaScript organizes objects. JSON consists of key-value pairs, which, while seemingly straightforward, can become complex due to nested data.
From our analysis, we can see that the "attachments" object contains crucial information such as competition names and IDs. Moving further, the "events" object holds match IDs and fixture names.
Let’s store the event data in a variable called data (a more descriptive name would be better). We can iterate through the event data, which corresponds to the match IDs, and append these to our list.
data = json_data['attachments']['events']
for i in data:
event_ids.append(i)
We will then construct the request URLs to retrieve market and odds data for each match:
end_of_url = '&exchangeLocale=en_GB&includeBettingOpportunities=true&includePrices=true&includeSeoCards=true&includeSeoFooter=true&language=en&loggedIn=false&priceHistory=1®ionCode=UK'
for i in event_ids:
url = start_of_url + i + end_of_url
print(url)
match_response = requests.get(url, headers=headers)
match_data = match_response.json()
Next, we will extract specific data such as the match name (fixture) and the competition name:
match_name = match_data['attachments']['events']
keys_list = list(match_name)
event_data = match_name.get(keys_list[0], {})
event_name = event_data.get('name', 'N/A')
event_time = event_data.get('openDate', 'N/A')
competition_name = match_data['attachments']['competitions']
keys_list = list(competition_name)
competition_data = competition_name.get(keys_list[0], {})
competition_name_new = competition_data.get('name', 'N/A')
competition_id = competition_data.get('competitionId', 'N/A')
We also gather market keys to scrape from the data:
markets = match_data['attachments']['markets'].keys()
for j in markets:
markets_new = match_data['attachments']['markets'][j]['marketName']
selections = match_data['attachments']['markets'][j]['runners']
for k in selections:
fixtures.append(event_name)
competitions.append(competition_name_new)
market_types.append(markets_new)
match_dates_and_time.append(event_time)
match_ids.append(i)
name = k['runnerName']
market_names.append(name)
current_time = datetime.now()
time_scraped.append(current_time)
print(name)
try:
odds = k['winRunnerOdds']['trueOdds']['decimalOdds']['decimalOdds']except:
odds = 'Issue with Odds'odds_matches.append(odds)
Finally, we will create a DataFrame to store all the collected data:
columns = ['Match ID', 'Fixture', "Match Time and Date", "Competition", "Market Type", "Market Name", "Market Odds", "Time Scraped"]
# Initialize a new DataFrame with columns
new_dataframe = pd.DataFrame(columns=columns)
# Add arrays to columns
new_dataframe['Match ID'] = match_ids
new_dataframe['Fixture'] = fixtures
new_dataframe['Match Time and Date'] = match_dates_and_time
new_dataframe['Competition'] = competitions
new_dataframe['Market Type'] = market_types
new_dataframe['Market Name'] = market_names
new_dataframe['Market Odds'] = odds_matches
new_dataframe['Time Scraped'] = time_scraped
new_dataframe
The output will look like this.
As always, feel free to reach out with any questions or feedback on this article. If you found it helpful, please give it a clap and follow for more!
Chapter 2: Video Resources
Explore how to scrape JSON data embedded in SCRIPT tags in this tutorial.
Learn to scrape live scores without the need for BeautifulSoup or Selenium.