Google Search Console Data Fetcher Python Tool

This tool was built for three main reasons:

1. For historic data – backup all data for all verified GSC account.

2. For future data – automatically add new data on a daily basis.


Google Search Console Dashboard

The big advantage of fetching the data daily is that it enables us to save as much data as possible, by different segments, for each individual day, thus enable us to take more accurate decisions based on a big amount of data.


3. Perhaps the most challenging one; Build SEO Dashboard for data analysis with data taken from multiple resources, blended together:


a. Google Search Console

b. Google Analytics

c. Log File, using my Log File Analyzer Python tool

d. SEO crawlers (i.e. Screaming Frog, Deep Crawl etc.)


Therefore, this tool is only the first step to achieve the final goal, which is to Build SEO Dashboard for data analysis, using Microsoft Power BI and / or Excel with Power Query.


So, let’s dive into the code. I detailed as much as I can, to help Python beginners such as myself, to crack it and to be able to actually use this tool.


GSC API Setup:


1. Enable Search Console API


First you need to create a project on Google Cloud Platform. I won’t go into details here, as this is straight forward and out of the scope of this project, but you can easily find a good Google documentation to accomplish this.


Once you setup a project, follow the steps below:


APIs & Services → Credentials → Create credentials → OAuth 2.0 client ID → Other → Create.


Google Cloud Platform API

Google Cloud QAuth Client ID

QAuth

Client ID

Download the JSON file and place it under ‘appengine’ folder in your Python project. The JSON file will look like this, with your own YOUR_CLIENT_ID, YOUR_PROJECT_ID, and YOUR_CLIENT_SECRET


Client Secret ID

2. Install Google Client Library

From command line: pip install – upgrade google-api-python-client (assuming you already installed pip)


Project Structure, Files and Code:


Two main python files are used in this project.


1. main.py

This is the main Python file that runs the GSC Data Fetcher tool.


Main Python File

Functions:


main() function

All the functions above will be run from this function, other than append_df_to_excel(), which is been called from within write_file() function .

1. Authorize access to Google Search Console API with authorize_access() function.

2. Retrieve list of verified properties in account with read_site_list() function.

3. Query data for those verified websites with query_data() function.

4. Save all data to an Excel file, per website.


def main():
    webmaster = authorize_access()
    read_site_list(webmaster)

    for site, site_url in config.site_list.items():
        gsc_data_date, gsc_data_page_date, gsc_data_query, gsc_data_device, gsc_data_page_query = query_data(webmaster, site_url)
        write_file(site, '''site_list''', gsc_data_date, gsc_data_page_date, gsc_data_query, gsc_data_device,
                   gsc_data_page_query)
    return site, site_url

authorize_access() function

Authorize access to Google Search Console API, using a pickle file and credentials, with authorize_access() function.


def authorize_access():
    try:
        credentials = pickle.load(open("appengine/credentials.pickle", "rb"))
    except (OSError, IOError) as e:
        flow = InstalledAppFlow.from_client_secrets_file(config.CLIENT_SECRET_JSON, scopes=config.OAUTH_SCOPE)
        credentials = flow.run_console()
        pickle.dump(credentials, open("appengine/credentials.pickle", "wb"))

    webmasters_service = build('webmasters', 'v3', credentials=credentials)
    return webmasters_service

read_site_list() function

For each site in your list – query data by date / page / query / device and combinations of those with query_data() function.


def read_site_list(webmasters_service):
    # Retrieve list of properties in account
    site_list = webmasters_service.sites().list().execute()

    # Filter for verified websites
    verified_sites_urls = [s['siteUrl'] for s in site_list['siteEntry']
                           if s['permissionLevel'] != 'siteUnverifiedUser'
                           and s['siteUrl'][:4] == 'http']

    # Printing & saving the URLs of all websites you are verified for.
    site_list = []
    for site_url in verified_sites_urls:
        site_list.append(site_url)
    return site_list


query_data() function

For each site in the list – query data by date / page / query / device and combinations of those with query_data() function. For each data set we do the follow:


1. Query data from webmasters_service.searchanalytics(), using two parameters:

a) siteURL

b) body – load the body with the json query taken from config.py file, i.e. config.GSC_QUERY_QUERY_DATE, which looks like this in config.py:


GSC_QUERY_DATE = {
    "startDate": START_DATE,
    "endDate": END_DATE,
    "searchType": "web",
    "dimensions": [
        "date"
    ],
    "rowLimit": 5000
}

2. normalize the data received in json to read all fetched rows from the query with json_normalize() function.


3. Split the nested json (i.e. “dimensions”: [“query”, “date”]) under ‘keys’ column from one column to as much as needed, in this case 2 columns. This is to avoid receiving the json response in one column, which will make it hard to analyze data, with gsc_data_date[‘keys’].apply(pd.Series).fillna(0) function.


4. Combine those 2 new columns to the whole data set with pd.concat() function.


def query_data(webmasters_service, site_url):
    # Query all data
    gsc_data_date = webmasters_service.searchanalytics().query(siteUrl=site_url, body=config.GSC_QUERY_DATE).execute()
    gsc_data_date = json_normalize(gsc_data_date['rows'])
    df_keys__date = gsc_data_date['keys'].apply(pd.Series).fillna(0)
    gsc_data_date = pd.concat([gsc_data_date, df_keys__date], axis=1)

    gsc_data_page_date = webmasters_service.searchanalytics().query(siteUrl=site_url, body=config.GSC_QUERY_PAGE_DATE).execute()
    gsc_data_page_date = json_normalize(gsc_data_page_date['rows'])
    df_keys_page_date = gsc_data_page_date['keys'].apply(pd.Series).fillna(0)
    gsc_data_page_date = pd.concat([gsc_data_page_date, df_keys_page_date], axis=1)

    gsc_data_query_date = webmasters_service.searchanalytics().query(siteUrl=site_url, body=config.GSC_QUERY_QUERY_DATE)<