In one of the meetings I’ve had with a content team few weeks back, I realized they have a serious issue with keep tracking competitor’s prices, and update the content accordingly on their site when competitors or providers are changing their prices and deals.
They have mentioned that it requires a lot of time and manual work, to go over so many competitor’s websites and pages, week over week, and manually check for price changes.
As a result, they are struggling to ensure that the content on the site is always accurate for their readers and customers.
Whether you work inhouse, or you own the entire damn house (quote credit to Traffic Think Tank), you wouldn’t want your users to find out that you’re providing them with the wrong information.
This is extremely bad in terms of E-A-T, and you should assume that users will lose their trust in you, which will lead to a drop in your perception as a leading expert in your field and your authoritativeness.
This talk made me think how I can automate this process and catch two birds with one stone:
1. Ensure content accuracy for the users, especially when it comes to prices (YMYL, you know…).
2. Help the content team work more efficiently, by enabling them to spend their time on the things that are important for them; content strategy, research, writing and so forth.
Who Should Use This Tool?
Well, anyone can use it of course, but a few use cases I have in mind are:
SEO & Product teams – can use it for competitor’s research, A/B price testing, and pricing strategy.
Content teams – can use it to update content on their sites with the right prices in case this process is not already automated, as in this case.
Bizdev / affiliates – can use it to negotiate better prices.
So, let’s dive now into the code, to learn how to scrape & compare prices and alert on price changes via email, with Python. These are the main steps we will cover in this column:
Save URLs to a CSV file
Check if URLs are valid
Scrape prices from the HTML files of those pages.
Compare the saved prices to the current (scraped) prices
Send an email alert if prices were changed
Write data to file
Use Python Libraries
from bs4 import BeautifulSoup
import requests
import pandas as pd
import smtplib
import config
import re
import numpy as np
import pathlib
Create a URL List & Validate URLs
First, we’ll create a CSV file and name it ‘input.csv’. This file will hold a list of URLs we’d like to scrape, for example our competitors, and place it under the main project folder.
This file is used in get_urls_from_file() function to fetch those URLs and use them in other functions later.
In get_urls_from_file() function we use pathlib library to get the path to the input.csv file on our local machine. This path’s string is configured on config.py file, which is used to keep the code here clean and easy to read (you will also find in config.py file other configurations we will use later, such as email configuration, user agent and more).
Next, we read the csv file using pandas library, loop over the URLs in the csv file, and store them in url_list[] object to return it later to other functions.
def get_urls_from_file():
url_list = []
urls_file_path = config.URL_SOURCE_FILE
df_urls = pd.read_csv(urls_file_path, header=None)
# validate that url is valid
for url in df_urls.values:
if is_valid_url(str(url)):
url_list.append(url)
return url_list
Additionally, we’d like to ensure that all URLs are valid, to prevent the tool from breaking if one of the URLs is not legit. We use a simple regex in is_valid_url(str) function, compile it with re library and return a Boolean, so either True or False, right?
def is_valid_url(str):
# Regex to check valid URL
regex = ("((http|https)://)(www.)?" +
"[a-zA-Z0-9@:%._\\+~#?&//=]" +
"{2,256}\\.[a-z]" +
"{2,6}\\b([-a-zA-Z0-9@:%" +
"._\\+~#?&//=]*)")
# Compile the ReGex
p = re.compile(regex)
# If the string is empty return false
if str is None:
return False
# Return if the string matched the ReGex
if re.search(p, str):
return True
else:
return False
Scrape Prices & Save to File
Next, in main() function, which runs all other functions on this script , we use pathlib library to define a path to the ‘saved-prices.csv’ file, so we could scrape those URLs saved there.
def main():
file = pathlib.Path('saved-prices.csv')
What is ‘saved-prices.csv’ file?
This file holds the prices of the URLs we chose to scrape, and is been created when we run write_data() function, which triggers the get_scraped_prices() function, that scrape the current prices (more on those functions later).
When you run the script for the first time, this CSV will be created and saved for the first time, so we can use it from now on to compare it against the scraped prices.
Later, if prices will be changed when we scrape the pages again, we would be able save the ‘new’ scraped prices to this file, and keep it updated for future executions.
Now, if this file already exists in our folder, which means we already run the script before – we can skip the write_data() function, and continue to execute the other functions.
def main():
file = pathlib.Path(config.PRICES_File_PATH)
if not file.exists():
write_data()
else:
compare_prices()
send_email()
write_data()
Scrape Prices from Competitors’ Pages
To scrape prices we use the get_scraped_prices() function as mentioned above, which gets triggered from within the compare_prices() and the write_data() function, therefore it is not appearing under the main() function above.
So let’s talk about this get_scraped_prices() function. To scrape the current prices from the HTML code of our competitors’ pages, we use BeautifulSoup library with an HTML parser.
1. We trigger the get_urls_from_file() function to get the URL list and save it to the urls_list variable.
2. Then, we set a user agent for BS4 to work properly (on config.py) and save it in the headers variable.
def get_scraped_prices():
urls_list = get_urls_from_file()
headers = config.USER_AGENT
scraped_prices_dict = {}
In this case we choose to use Chrome UA user agent, and here’s this line on config.py:
USER_AGENT = {'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36)"}
3. We also create a new dictionary called scraped_prices_dict{} to hold all prices we will soon scrape from pages, and deliver them later to other functions.
4. Then, we loop over each URL in the list and perform the following actions:
A) Scrape HTML with BeautifulSoup library.
B) Fetch the text from some HTML elements, where prices might be found.
for url in urls_list:
scraped_prices = []
page = requests.get(url[0], headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
price = soup.find_all(['class', 'h1', 'h2', 'span', 'div', 'a', 'title', 'del', 'a', 'p'], text=re.compile(r'\$'))
C) Use Regex to filter only the prices with a dollar sign ($) from the text in the elements above.
D) Save the URLs with their corresponding prices to the scraped_prices_dict{} dictionary we created above, so we can use it in the next function, which will compare old to new prices.
dollars = []
for x in re.findall('(\$[0-9]+(\.[0-9]+)?)', str(price)):
dollars.append(x[0])
price_digit = []
for x in re.findall('([0-9]+(\.[0-9]+)?)', str(dollars)):
price_digit.append(x[0])
price_digit_unique = set(price_digit)
for price in price_digit_unique:
price = float(price)
scraped_prices.append(price)
scraped_prices.sort()
scraped_prices_dict.update({str(url): scraped_prices})
return scraped_prices_dict
Here’s how the dictionary will look if you print it in console:
{
"['https://www…’]": [6.99, 20, 50, 80],
"['https://...com.’]": [89, 286.8],
"['https://www….']": [],
"['https://….com’]": [20.0, 60.0],
"['https://www….com']": [2.88, 3.99, 4.5, 4.92]
}
Please notice that the code is limited to US dollars only ($) appearing on the left side of the price digits, and only on those specific HTML elements. You can create your own regex to include all currencies, or even use advanced tools such as NLTK to make this more robust.
By using the regex above, I chose to keep this code simple, mostly for those of you who are just starting with Python, and would like to get this tool up and running, rather than dealing with more advanced code.
Compare Saved Prices to Current Prices
Now, when we have all data in our hands, meaning both the prices we scraped in the first run (and saved in saved-prices.csv), AND the current prices we scraped later in a second run – we can compare the two data sets to find any price discrepancies. Here’s how we do it:
1. Get saved prices
We use pandas to read the prices from the saved-prices.csv path defined in config.py
def compare_prices():
# Get saved prices from file
prices_file_path = config.PRICES_File_PATH
df_saved_prices = pd.read_csv(prices_file_path)
2. Get current prices
We call the get_scraped_prices() function, which returns a dictionary object, to get its keys (URLs) and values (prices), and construct it to a data frame.
# Get scraped prices
prices_values = list(get_scraped_prices().values())
price_keys = list(get_scraped_prices().keys())
df_scraped_prices = pd.DataFrame.from_dict(prices_values).transpose().fillna(0).reset_index(drop=True)
df_scraped_prices.columns = price_keys
3. Compare saved vs. current prices
We run a comparison of those 2 data frames (df_saved_prices and df_scraped_prices) to identify where data prices are different (!=) and stack the results.
For each change we use numpy library to get the URL and its two prices; saved (changed_from) and current (changed_to).
# Compare saved prices to scraped prices
ne_stacked = (df_saved_prices != df_scraped_prices).stack()
for change in ne_stacked:
if change:
changed = ne_stacked[ne_stacked]
changed.index.names = ['ID', 'URL']
difference_locations = np.where(df_saved_prices != df_scraped_prices)
changed_from = df_saved_prices.values[difference_locations]
changed_to = df_scraped_prices.values[difference_locations]
Then, we construct a data frame with all the price changes we have found above, per URL. You can also save it to a local csv file for any future reference, though it’s not a must.
df_price_changes = pd.DataFrame({'Saved Price': changed_from, 'Scraped Price': changed_to}, index=changed.index)
df_price_changes.to_csv('price-changes.csv', index=False, header=True, mode='w')
4. Build a list of URLs & price changes
Next, we iterate over each row in our df_price_changes with itertuples(), and format it into a message string (msg) that includes the URL, the prices that were changed for this URL, and some text surrounding it.
We save it to a list (alerts in this case), which will be used later in our email sending function, to alert on all price changes we have found.
alerts = []
for row in df_price_changes.itertuples(df_price_changes.index.names):
url = row[0][1]
prices = row[1:]
msg = f'A price value on page {url} has been changed from ${prices[0]} to ${prices[1]}'
alerts.append(msg)
return alerts
Send Email Alerts via Gmail
We will use the Gmail server to send emails, again, for simplicity and a fast execution, but this can be done in other ways, such as using Amazon ECS or EC2, for example.
First we call the compare_prices() function and check if it returns any data (price changes) before we trigger the email sending, as we’d like to do that only if we find prices that were changed, right?
def send_email():
alerts = compare_prices()
If we get a list back from this function, we will take the following steps to send the email alert:
1. Print to console
Print ‘Sending email…’ to the console, so you’ll know the process is working and running.
print('Sending email...')
2. Configure Gmail settings
In our config.py file we set all the relevant Gmail server parameters:
Server location or IP – smtp.gmail.com
Port – 587
Email from
Email Password
Email to
If you’d like to use another email service – you’ll have to find its own settings and update the config.py file accordingly.
Now, to be able to send the email from Gmail, we need first to allow ‘less secure apps’ on the Gmail account. Use this link https://www.google.com/settings/security/lesssecureapps and turn it on.
Then, going back to main.py file, we call those parameters from config.py, and use smtplib library to set the server.
# Get Email Settings
email_from = config.EMAIL_FROM
email_to = config.EMAIL_TO
password = config.EMAIL_PASSWORD
server = smtplib.SMTP(config.EMAIL_SERVER, config.EMAIL_PORT)
3. Create a message
We create a message as a string with the URLs and price changes we have found in the alerts list.
# Create message
msg = 'Hi Content team! \n\n' 'You are receiving this alert because the price for some of our competitors has been changed.' \
'\n\n' 'See the summary below.\n\n' + str(alerts).strip('[]').replace(', ', '\n\n').replace('"', '')
message = '\r\n'.join([
'From:' + email_from,
'To:' + email_to,
'Subject: Price Changes Alert!!!',
msg
])
4. Start server & send message
First run server.starttls(), which is a security function to connect to the Gmail server (it protects the email password).
Then login to the email used to send this alert with server.login(), send the email with server.sendmail(), and finally print ‘Email alert sent’ to console, and close the server with server.quit() once email has been sent.
server.starttls()
server.login(email_from, password)
server.sendmail(email_from, [email_to], message)
print('Email alert sent')
server.quit()
Otherwise, if nothing gets returned from compare_prices() – it means no price changes were found and we print a message on our console to let you know that the code has completed its execution.
else:
print('No price changes were found')
And here’s a how the email looks like. Not the best design, but works.
Write Prices to File
Last step is using the write_data() function to write the data to our saved-prices.csv file.
This function will be executed in one of the following scenarios:
1. It’s the first time we run the tool and need to create this file for the first time, to be able to compare it against our scrape. Remember those two lines below from main() function?
if not file.exists():
write_data()
2. We already have this file from a previous run and would like to override it with the new prices we scraped, to keep this file updated with the correct prices for the next run.
how-to-detect-competitors-price-changes-with-python-send-email-alerts
As previously, we call the get_scraped_prices() function, which returns a dictionary with URLs as keys and prices as values, convert it to a data frame using pd.DataFrame.from_dict() and save it to the ‘saved-prices’ CSV file.
Access to Full Code
And this is it! We should now have an updated prices file, and we are ready to run the tool again, whenever we want.
You can also use a (.bat) file to run this Python script automatically on your local machine, on a daily/weekly/monthly basis, but this is out of scope for this column. Nevertheless, this is very easy to implement and you can find plenty of resources on how to do that online.
Here is a link to my full open source code of this Price Alert tool – https://github.com/napo7890/price-alert
Please let me know how this tool works for you, and help us make it even better, for the benefit of our entire SEO community :)
Commentaires