- Mar 12, 2019
- 7 min read

Google Search Console Data Fetcher Python Tool

This tool was built for three main reasons:

1. For historic data – backup all data for all verified GSC account.

2. For future data – automatically add new data on a daily basis.

The big advantage of fetching the data daily is that it enables us to save as much data as possible, by different segments, for each individual day, thus enable us to take more accurate decisions based on a big amount of data.

3. Perhaps the most challenging one; Build SEO Dashboard for data analysis with data taken from multiple resources, blended together:

a. Google Search Console

b. Google Analytics

c. Log File, using my Log File Analyzer Python tool

d. SEO crawlers (i.e. Screaming Frog, Deep Crawl etc.)

Therefore, this tool is only the first step to achieve the final goal, which is to Build SEO Dashboard for data analysis, using Microsoft Power BI and / or Excel with Power Query.

So, let’s dive into the code. I detailed as much as I can, to help Python beginners such as myself, to crack it and to be able to actually use this tool.

GSC API Setup:

1. Enable Search Console API

First you need to create a project on Google Cloud Platform. I won’t go into details here, as this is straight forward and out of the scope of this project, but you can easily find a good Google documentation to accomplish this.

Once you setup a project, follow the steps below:

APIs & Services → Credentials → Create credentials → OAuth 2.0 client ID → Other → Create.

Client ID

Download the JSON file and place it under ‘appengine’ folder in your Python project. The JSON file will look like this, with your own YOUR_CLIENT_ID, YOUR_PROJECT_ID, and YOUR_CLIENT_SECRET

2. Install Google Client Library

From command line: pip install – upgrade google-api-python-client (assuming you already installed pip)

Project Structure, Files and Code:

Two main python files are used in this project.

1. main.py

This is the main Python file that runs the GSC Data Fetcher tool.

Functions:

main() function

All the functions above will be run from this function, other than append_df_to_excel(), which is been called from within write_file() function .

1. Authorize access to Google Search Console API with authorize_access() function.

2. Retrieve list of verified properties in account with read_site_list() function.

3. Query data for those verified websites with query_data() function.

4. Save all data to an Excel file, per website.

def main():
    webmaster = authorize_access()
    read_site_list(webmaster)

    for site, site_url in config.site_list.items():
        gsc_data_date, gsc_data_page_date, gsc_data_query, gsc_data_device, gsc_data_page_query = query_data(webmaster, site_url)
        write_file(site, '''site_list''', gsc_data_date, gsc_data_page_date, gsc_data_query, gsc_data_device,
                   gsc_data_page_query)
    return site, site_url

authorize_access() function

Authorize access to Google Search Console API, using a pickle file and credentials, with authorize_access() function.

def authorize_access():
    try:
        credentials = pickle.load(open("appengine/credentials.pickle", "rb"))
    except (OSError, IOError) as e:
        flow = InstalledAppFlow.from_client_secrets_file(config.CLIENT_SECRET_JSON, scopes=config.OAUTH_SCOPE)
        credentials = flow.run_console()
        pickle.dump(credentials, open("appengine/credentials.pickle", "wb"))

    webmasters_service = build('webmasters', 'v3', credentials=credentials)
    return webmasters_service

read_site_list() function

For each site in your list – query data by date / page / query / device and combinations of those with query_data() function.

def read_site_list(webmasters_service):
    # Retrieve list of properties in account
    site_list = webmasters_service.sites().list().execute()

    # Filter for verified websites
    verified_sites_urls = [s['siteUrl'] for s in site_list['siteEntry']
                           if s['permissionLevel'] != 'siteUnverifiedUser'
                           and s['siteUrl'][:4] == 'http']

    # Printing & saving the URLs of all websites you are verified for.
    site_list = []
    for site_url in verified_sites_urls:
        site_list.append(site_url)
    return site_list

query_data() function

For each site in the list – query data by date / page / query / device and combinations of those with query_data() function. For each data set we do the follow:

1. Query data from webmasters_service.searchanalytics(), using two parameters:

a) siteURL

b) body – load the body with the json query taken from config.py file, i.e. config.GSC_QUERY_QUERY_DATE, which looks like this in config.py:

GSC_QUERY_DATE = {
    "startDate": START_DATE,
    "endDate": END_DATE,
    "searchType": "web",
    "dimensions": [
        "date"
    ],
    "rowLimit": 5000
}

2. normalize the data received in json to read all fetched rows from the query with json_normalize() function.

3. Split the nested json (i.e. “dimensions”: [“query”, “date”]) under ‘keys’ column from one column to as much as needed, in this case 2 columns. This is to avoid receiving the json response in one column, which will make it hard to analyze data, with gsc_data_date[‘keys’].apply(pd.Series).fillna(0) function.

4. Combine those 2 new columns to the whole data set with pd.concat() function.

def query_data(webmasters_service, site_url):
    # Query all data
    gsc_data_date = webmasters_service.searchanalytics().query(siteUrl=site_url, body=config.GSC_QUERY_DATE).execute()
    gsc_data_date = json_normalize(gsc_data_date['rows'])
    df_keys__date = gsc_data_date['keys'].apply(pd.Series).fillna(0)
    gsc_data_date = pd.concat([gsc_data_date, df_keys__date], axis=1)

    gsc_data_page_date = webmasters_service.searchanalytics().query(siteUrl=site_url, body=config.GSC_QUERY_PAGE_DATE).execute()
    gsc_data_page_date = json_normalize(gsc_data_page_date['rows'])
    df_keys_page_date = gsc_data_page_date['keys'].apply(pd.Series).fillna(0)
    gsc_data_page_date = pd.concat([gsc_data_page_date, df_keys_page_date], axis=1)

    gsc_data_query_date = webmasters_service.searchanalytics().query(siteUrl=site_url, body=config.GSC_QUERY_QUERY_DATE).execute()
    gsc_data_query_date = json_normalize(gsc_data_query_date['rows'])
    df_keys_query_date = gsc_data_query_date['keys'].apply(pd.Series).fillna(0)
    gsc_data_query_date = pd.concat([gsc_data_query_date, df_keys_query_date], axis=1)

    gsc_data_device = webmasters_service.searchanalytics().query(siteUrl=site_url, body=config.GSC_QUERY_DEVICE).execute()
    gsc_data_device = json_normalize(gsc_data_device['rows'])
    df_keys_gsc_data_device = gsc_data_device['keys'].apply(pd.Series).fillna(0)
    gsc_data_device = pd.concat([gsc_data_device, df_keys_gsc_data_device], axis=1)

    gsc_data_page_query_date = webmasters_service.searchanalytics().query(siteUrl=site_url, body=config.GSC_QUERY_QUERY_DATE_PAGE).execute()
    gsc_data_page_query_date = json_normalize(gsc_data_page_query_date['rows'])
    df_keys_page_query_date = gsc_data_page_query_date['keys'].apply(pd.Series).fillna(0)
    gsc_data_page_query_date = pd.concat([gsc_data_page_query_date, df_keys_page_query_date], axis=1)

    return gsc_data_date, gsc_data_page_date, gsc_data_query_date, gsc_data_device, gsc_data_page_query_date

write_file() function

Save all data sets to an Excel file, one data set per spreadsheet with write_file() function.

1. Set ‘keys’ as index i.e. df_gsc_data_date = pd.DataFrame(gsc_data_date).set_index(‘keys’)

2. Set column names i.e. df_gsc_data_date.columns = [‘Clicks’, ‘CTR’, ‘Impressions’, ‘Position’, ‘Date’]

3. Drop null value to avoid errors with gsc_data_page_query_date.dropna(inplace=True)

4. Save all data sets to an Excel file (other than ‘Data by Device’ which requires a different implementation if we’d like to run this file automatically and append data).

a) index=False is to remove the index (‘keys’ in this case) from the final output, as we don’t need it anymore once we split, as mentioned above.

b) header=False is to prevent the header from being added each time the script is running.

Now it’s a bit tricky here, I must say.

I need to find a way (probably with a simple ‘if’ statement) to pull the headers only for the first time we run the script, and avoid it on all other runs, but I haven’t done that yet.

So, for now you’ll have to switch to header=True in the first run, to get the titles, and then switch again to header=False, to get it right.

def write_file(site, gsc_data_date, gsc_data_page_date, gsc_data_query_date, gsc_data_device, gsc_data_page_query_date):
    filename = 'data/' + site + '-' + 'gsc-data-backup.xlsx'

    df_gsc_data_date = pd.DataFrame(gsc_data_date).set_index('keys')
    df_gsc_data_date.columns = ['Clicks', 'CTR', 'Impressions', 'Position', 'Date']

    df_gsc_data_page_date = pd.DataFrame(gsc_data_page_date).set_index('keys')
    df_gsc_data_page_date.columns = ['Clicks', 'CTR', 'Impressions', 'Position', 'Page', 'Date']

    df_gsc_data_query_date = pd.DataFrame(gsc_data_query_date).set_index('keys')
    df_gsc_data_query_date.columns = ['Clicks', 'CTR', 'Impressions', 'Position', 'Query', 'Date']

    df_gsc_data_device = pd.DataFrame(gsc_data_device).set_index('keys')
    df_gsc_data_device.columns = ['Clicks', 'CTR', 'Impressions', 'Position', 'Platform']

    df_gsc_data_page_query_date = pd.DataFrame(gsc_data_page_query_date).set_index('keys')
    df_gsc_data_page_query_date.columns = ['Clicks', 'CTR', 'Impressions', 'Position', 'Query', 'Date', 'Page']

    # dropping null value columns to avoid errors
    gsc_data_page_query_date.dropna(inplace=True)

    append_df_to_excel(filename, df_gsc_data_date, sheet_name='Data by Date', index=False, header=False)
    append_df_to_excel(filename, df_gsc_data_page_date, sheet_name='Data by Page Date', index=False, header=False)
    append_df_to_excel(filename, df_gsc_data_query_date, sheet_name='Data by Query Date', index=False, header=False)
    append_df_to_excel(filename, df_gsc_data_page_query_date, sheet_name='Data by Query Date Page', index=False, header=False)
    # append_df_to_excel(filename, df_gsc_data_device, sheet_name='Data by Device', index=False, header=False)

append_df_to_excel() function

This function tries to open an existing workbook. If one isn’t existing, it will be automatically created, and append the data to row zero (startrow = 0).

If the Excel file exists, append the data just below the last row with data that was found in the file with startrow = writer.book[sheet_name].max_row.

def write_file(site, gsc_data_date, gsc_data_page_date, gsc_data_query_date, gsc_data_device, gsc_data_page_query_date):
    filename = 'data/' + site + '-' + 'gsc-data-backup.xlsx'

    df_gsc_data_date = pd.DataFrame(gsc_data_date).set_index('keys')
    df_gsc_data_date.columns = ['Clicks', 'CTR', 'Impressions', 'Position', 'Date']

    df_gsc_data_page_date = pd.DataFrame(gsc_data_page_date).set_index('keys')
    df_gsc_data_page_date.columns = ['Clicks', 'CTR', 'Impressions', 'Position', 'Page', 'Date']

    df_gsc_data_query_date = pd.DataFrame(gsc_data_query_date).set_index('keys')
    df_gsc_data_query_date.columns = ['Clicks', 'CTR', 'Impressions', 'Position', 'Query', 'Date']

    df_gsc_data_device = pd.DataFrame(gsc_data_device).set_index('keys')
    df_gsc_data_device.columns = ['Clicks', 'CTR', 'Impressions', 'Position', 'Platform']

    df_gsc_data_page_query_date = pd.DataFrame(gsc_data_page_query_date).set_index('keys')
    df_gsc_data_page_query_date.columns = ['Clicks', 'CTR', 'Impressions', 'Position', 'Query', 'Date', 'Page']

    # dropping null value columns to avoid errors
    gsc_data_page_query_date.dropna(inplace=True)

    append_df_to_excel(filename, df_gsc_data_date, sheet_name='Data by Date', index=False, header=False)
    append_df_to_excel(filename, df_gsc_data_page_date, sheet_name='Data by Page Date', index=False, header=False)
    append_df_to_excel(filename, df_gsc_data_query_date, sheet_name='Data by Query Date', index=False, header=False)
    append_df_to_excel(filename, df_gsc_data_page_query_date, sheet_name='Data by Query Date Page', index=False, header=False)
    # append_df_to_excel(filename, df_gsc_data_device, sheet_name='Data by Device', index=False, header=False)

The file will be created / saved under /data folder by the name of [site name]-gsc-data-backup.xlsx, and will look like that:

2. config.py file

The config file contains the following:

1. A list of verified sites (site_list = {‘site’: ‘site_url’}) to fetch data from. Add as many as you want.

2. Set date range for data fetching with START_DATE, END_DATE and OFFSET_DATE. Depend on your purpose (i.e backup data or run automatically every day), you can alter those accordingly.

3. START_DATE = END_DATE is to ensure I’m fetching data for only one day.

4. OFFSET_DATE = 4 is to ensure we’re fetching data from 4 days ago (to align with GSC 3 days data delivery + one day to stay on the safe side due to different time zones).

5. Defining access scope with OAUTH_SCOPE = (‘https://www.googleapis.com/auth/webmasters.readonly’, ‘https://www.googleapis.com/auth/webmasters’).

6. Redirect URI for installed apps; REDIRECT_URI = ‘urn:ietf:wg:oauth:2.0:oob’ – not sure this is even necessary TBH.

7. CLIENT_SECRET_JSON = ‘YOUR_CLIENT_SECRET.json’. To be able to call the json file. Change to your own file name.

8. Five queries to pull segmented data (sorry for the horrible naming here).

a. GSC_QUERY_DATE : fetch data by date

b. GSC_QUERY_PAGE_DATE : fetch data by date-page

c. GSC_QUERY_QUERY_DATE : fetch data by query-date

d. GSC_QUERY_DEVICE : fetch data by device (desktop, mobile).

e. GSC_QUERY_QUERY_DATE_PAGE :fetch data by query-date-page.

The first 4 are simpler and can be used for high level reports, however I’m most interesting in fetching data by query, date and page (the fifth one) – to be able to analyze traffic and rankings for each page and query, based on date.

from datetime import datetime, timedelta
import pytz

site_list = {'site': 'site_url'}

# Since GCS data is dated, specify the offset from today.
OFFSET_DATE = 4
# Set Timezone ('US/Eastern', 'US/Central', 'US/Pacific', 'Europe/London')
GSC_TIMEZONE = 'US/Pacific'

END_DATE = (datetime.now(pytz.timezone(GSC_TIMEZONE)) - timedelta(days=OFFSET_DATE)).strftime(
   "%Y-%m-%d")  # Returns the GCS format date based on OFFSET_DATE

START_DATE = END_DATE

OAUTH_SCOPE = ('https://www.googleapis.com/auth/webmasters.readonly', 'https://www.googleapis.com/auth/webmasters')

# Redirect URI for installed apps
REDIRECT_URI = 'urn:ietf:wg:oauth:2.0:oob'

CLIENT_SECRET_JSON = 'YOUR_CLIENT_SECRET.json'

GSC_QUERY_DATE = {
    "startDate": START_DATE,
    "endDate": END_DATE,
    "searchType": "web",
    "dimensions": [
        "date"
    ],
    "rowLimit": 5000
}

GSC_QUERY_PAGE_DATE = {
    "startDate": START_DATE,
    "endDate": END_DATE,
    "searchType": "web",
    "dimensions": [
        "page",
        "date"
    ],
    "rowLimit": 5000
}

GSC_QUERY_QUERY_DATE = {
    "startDate": START_DATE,
    "endDate": END_DATE,
    "searchType": "web",
    "dimensions": [
        "query",
        "date"
    ],
    "rowLimit": 5000
}

GSC_QUERY_DEVICE = {
    "startDate": START_DATE,
    "endDate": END_DATE,
    "searchType": "web",
    "dimensions": [
        "device"
    ],
    "rowLimit": 5000
}

GSC_QUERY_QUERY_DATE_PAGE = {
    "startDate": START_DATE,
    "endDate": END_DATE,
    "searchType": "web",
    "dimensions": [
        "query",
        "date",
        "page"
    ],
    "rowLimit": 5000
}

3. Run the tool automatically with a Bat File

Create a .bat file that responsible to run main.py automatically, on a daily basis, and place it under your main project folder. This can be achieved easily with some online guides on this topic. Takes about 2 minutes to do that.

Next Step

In the next step we’ll load this daily updated file to Microsoft Power BI, or Excel with Power Query, and build a dashboard where you can slice and dice all data as you wish. Here’s a snapshot of how it looks like:

GSC Dashboard

Link to project on Github – https://github.com/napo7890/gsc-data-fetcher

Hope you’ll find this tool usable, any feedback will be much appreciated.