Data Science with Keaton: Efficient Web Scraping

Keaton Sant

Overview

Project Background

For my Senior Project, I am collecting car dealership inventory data that I can use to create a dashboard in PowerBI to see which cars are being sold where and for which price. In order to get the data I have to webscrape each dealership’s data.

Web Scraping

Getting Started With Web Scraping

Web scraping is a very useful skill for collecting data when an API is not available.

If you are not familiar with how HTML and CSS work then it might be difficult for you to figure out how to get the specific data that you want. Web scraping will be a lot more manageable when you understand exactly what you are looking for instead of having to rely on documentation to get through everything. Not all data has a Class or an ID attribute in the HTML tag, so you will need to know all about the different CSS Selectors.

You can use multiple programming languages to web scrape; however, I am mostly familiar with Python and R.

R

If you are skilled in the R language, there is a supported package for web scraping called Rvest. I have used it to scrape data from BoardGameGeek and it is very fluent and easy to use. (Then after hours of creating a script that loops through their website and collects all of their data, I found out that you can find BoardGameGeek data in a sqlite dataset available to the public… but the point is that Rvest works great.)

Python

Python is used heavily for web scraping and there are many libraries available to help you get what you need from any website. After reading through many of the different web scraping libraries for Python, I decided to go with Beautiful Soup. Beautiful Soup is an HTML parser and consists of many different functions that make extracting data simple.

My Weapon of Choice

I chose Python to complete this project. Object Oriented Programming is very helpful when web scraping and Python does that very well. For the scope of my project, I am going to want to find any pattern I can in each website to tie them together so that I can build an object that does all the work for me. Unless you want to write a script for every individual website, you will want to follow that advice. (It is possible to create objects in R, but I prefer to use Python)

Which Libraries Will You Need?

Honestly, when I started I thought that I would only need to learn how to use Beautiful Soup" and the rest would be golden. One of the issues I ran into was that Beautiful Soup doesn’t render JavaScript. So, any data that you need that is generated by JavaScript will not show up in your Beautiful Soup selectors. There are other libraries that you can use to get this done. One of them is Selenium and another is Requests-HTML. (For those of you coding in R, there is a version of Selenium for R)

When you load a website, all of the HTML and the CSS is generated and then the JavaScript runs and generates its content. Parsers like Beautiful Soup will open and parse the website without waiting for the javascript to render. For this reason, it cannot be used to gather data from JavaScript.

When scraping, I will use Beautiful Soup for the majority of content and then only use one of the JavaScript parsers for the few cases in which JavaScript renders the information on the page. Fortunately, on most of the websites I have scraped, the majority of content is generated with the HTML and CSS.

I have been experimenting with both Requests-HTML and Selenium. I started with Selenium but then backed out because it requires you to install a little more than just a library so I started using Requests-HTML. Requests-HTML worked for me until I tried running it in an Object. It threw a few errors and told me to use the Async function. I tested out the Async function and that gave me other errors that I have not yet resolved. If I cannot figure out how to use Requests-HTML’s Async function, I will go back to using Selenium. Requests-HTML did give me a couple of problems at times. It would try to render the website and fail some of the time, which would completely break my code, so I had to tell it to try again if it failed. That seemed to have resolved that issue.

Tips for Web Scraping Multiple Websites and Pages

Like I said before, it is a great idea to find any websites that have similar HTML and CSS structure. I found that car dealerships tend to have similar site structures. For example, a lot of Ford dealerships have similar structures as each other and Subaru websites have similar structures. Over the last month, one of the Ford dealerships I was successfully scraping changed their website so their site broke every time I ran the script. After looking through their new structure, I found that it was exactly the same as the Subaru website I was looking at. Creating a class for each type of HTML structure will save you a lot of time.

Then there is the problem with having to flip through multiple pages of a website. Libraries like Selenium actually have functions that allow you to do this, but I got mine to work just fine while just using Beautiful Soup. You will want to look for any indicator in the URL that you can use to tell which page you are on. After clicking on page two, look up at the URL and it might have something similar to ‘?page=2’ or ‘?start=15’. From there you need to determine the pattern they are using and then create a loop that increments with that pattern and parses the URL with the new value. Here is an example using sudo code:


page = 1
url = f'https://www.mysite.com?page={page}'

for loop:
  r = render(url)
  r.select('class=WebsiteClass')
  
  page += 1

This code dynamically sets the page number in the URL and then parses that website. It then gets a value on the first page, increments the page number to 2, and loops the process until it gets to the last page. You will need some way for it to know how many pages are on the website. In my case, I scraped the number of vehicles in the inventory and luckily the URL actually gives me the number of vehicles on the page instead of the page number. From there I do math to figure out how many pages there are by dividing the inventory count by the number of vehicles on the page.

Conclusion

There are plenty of open source materials to get you on your way in web scraping. There is even the stack overflow community who are ready to answer your questions when you can’t find the answers online. Keep in mind that when you are web scraping, people do change their websites. I advise you to make your code as debug friendly as possible because it may be working today but tomorrow it will probably break.

Example Code

Here is a class that I created to get data from dealerships:


class vistadash:
    
    def __init__(self, urlNew, urlUsed, limit, dealership, Date, urlNewAddOn="", urlUsedAddOn=""):
        """
        @param urlNew: The url that requests all of the new inventory

        @param urlUsed: The url that requests all of the used inventory

        @param limit: The number of cars listed per page

        @dealership: The name of the dealership

        @param Date: The timestamp that this was ran

        --------------------------------------------------------------

        @param urlNewAddOn: This is any extension to the new url

        @param urlUsedAddOn: This is any extension to the used url

        How the url should be inserted: 'https://www.dealership.com/new-inventory/index.htm'
        
        The code will automatically add '?start=0&' to the url for looping purposes

        Any part of the url that comes after 'https://www.dealership.com/new-inventory/index.htm?start=0&' should be inserted to the @param urlAddOn

        --------------------------------------------------------------

        This class creates a web scraping object that is specialized in 
        scraping car dealership websites that are based on VistaDash. 
        """
        self.urlNew = urlNew
        self.urlUsed = urlUsed
        self.limit = limit 
        self.dealership = dealership
        self.Date = Date
        self.urlNewAddOn = urlNewAddOn
        self.urlUsedAddOn = urlUsedAddOn

    def scrape(self):
        """
        This function scrapes the website for all of the inventory data.
        """
        #Already set columns
        dealership = self.dealership
        Date = self.Date

        print(dealership)

        #Data set headers
        dataset = np.array(['make', 'model', 'trim', 'body', 'vin', 'year', 'condition', 
        'int_color', 'ext_color', 'engine', 'tran', 'drive', 'fuel', 'miles', 
        'price', 'msrp', 'state', 'city', 'latitude', 'longitude', 'dealership', 'date'])        

        # parse website
        print("Parsing HTML...")
        limit = self.limit
        extension = 0
        
        if len(self.urlNewAddOn) > 0:
            url = f'{self.urlNew}?start={extension}&{self.urlNewAddOn}'
        else:
            url = f'{self.urlNew}?start={extension}&'

        website = requests.get(url)
        soup = BeautifulSoup(website.content, 'html.parser')
        print("Successfully parsed HTML!")

        # get address from website
        street = soup.select('span[class*=street-address]')[0].string
        city = soup.select('span[class*=locality]')[0].string
        state = soup.select('span[class*=region]')[0].string
        zip_code = soup.select('span[class*=postal-code]')[0].string

        latitude, longitude = f.get_lat_lon(state, city, zip_code, street)

        ### NEW INVENTORY ###
        print("NEW INVENTORY")

        # get inventory count
        inventory_count = int(soup.select('span[class*=vehicle-count]')[0].string)

        # Get page limit from url
        inv_range, extra, out_of_limit = f.get_page_limit(inventory_count, limit)

        print("Starting web scrape for new inventory data...")

        index_count = 0

        for x in range(inv_range):
            print("Parsing HTML...")
            website = requests.get(url)
            soup = BeautifulSoup(website.content, 'html.parser')
            print("Successfully parsed HTML!")

            cars = soup.find_all(attrs={"class" : "hproduct"}) # get data from attributes

            count = 1
            
            page_count = int((extension/limit)+1) # counts which page the scrape is on
            print(f"Page: {page_count} of {inv_range}")

            if page_count == inv_range and extra == True:
                limit = out_of_limit
                    
            for i in range(limit):
                index_count += 1
                make = cars[i]["data-make"]
                model = cars[i]["data-model"]
                trim = cars[i]["data-trim"]
                body = cars[i]["data-bodystyle"]
                vin = cars[i]["data-vin"]
                year = int(cars[i]["data-year"])
                condition = cars[i]["data-type"]
                if soup.select(f'div[data-index-position*="{count}"] dl[class*=last] dt')[0].string == "Interior Color:":
                    int_color = soup.select(f'div[data-index-position*="{count}"] dl[class*=last] dd')[0].text.strip(",")
                elif soup.select(f'div[data-index-position*="{count}"] dl[class*=last] dt')[1].string == "Interior Color:":
                    int_color = soup.select(f'div[data-index-position*="{count}"] dl[class*=last] dd')[1].text.strip(",")
                else:
                    int_color = None
                ext_color = cars[i]["data-exteriorcolor"]
                engine = soup.select(f'div[data-index-position*="{count}"] dd')[0].text.strip(",")
                tran = soup.select(f'div[data-index-position*="{count}"] dd')[1].text.strip(",")
                drive = None
                fuel = None
                miles = None
                price = soup.select(f'div[data-index-position*="{count}"] span.final-price span.value')[0].string
                if price == 'Please Call':
                    price = None
                else:
                    price = int(soup.select(f'div[data-index-position*="{count}"] span.final-price span.value')[0].string.strip('$').replace(',',''))
                try:
                    msrp = soup.select(f'div[data-index-position*="{count}"] span.retailValue span.value')[0].string
                    if msrp == 'Please Call':
                        msrp = None
                    else:
                        msrp = int(soup.select(f'div[data-index-position*="{count}"] span.retailValue span.value')[0].string.strip('$').replace(',',''))
                except:
                    msrp = int(soup.select(f'div[data-index-position*="{count}"] span.msrp span.value')[0].string.strip('$').replace(',',''))
                state = state
                city = city
                car = [make, model, trim, body, vin, year, condition, 
                    int_color, ext_color, engine, tran, drive, 
                    fuel, miles, price, msrp, state, 
                    city, latitude, longitude, dealership, Date]
                dataset = np.vstack([dataset, car])

                print(index_count)
                count += 1
                
            extension += limit

        ### USED INVENTORY ###
        print("USED INVENTORY")
        
        # parse website
        limit = self.limit
        extension = 0
        if len(self.urlUsedAddOn) > 0:
            url = f'{self.urlUsed}?start={extension}&{self.urlUsedAddOn}'
        else:
            url = f'{self.urlUsed}?start={extension}&'
        print("Parsing HTML...")
        website = requests.get(url)
        soup = BeautifulSoup(website.content, 'html.parser')
        print("Successfully parsed HTML!")

        # get inventory count
        inventory_count = int(soup.select('span[class*=vehicle-count]')[0].string)

        # Get page limit from url
        inv_range, extra, out_of_limit = f.get_page_limit(inventory_count, limit)

        print("Starting web scrape for used inventory data...")

        index_count = 0

        for x in range(inv_range):
            print("Parsing HTML...")
            website = requests.get(url)
            soup = BeautifulSoup(website.content, 'html.parser')
            print("Successfully parsed HTML!")

            cars = soup.find_all(attrs={"class" : "hproduct"}) # get data from attributes

            count = 1
            
            page_count = int((extension/limit)+1) # counts which page the scrape is on
            print(f"Page: {page_count} of {inv_range}")

            if page_count == inv_range and extra == True:
                limit = out_of_limit
                    
            for i in range(limit):
                index_count += 1
                make = cars[i]["data-make"]
                model = cars[i]["data-model"]
                trim = cars[i]["data-trim"]
                body = cars[i]["data-bodystyle"]
                vin = cars[i]["data-vin"]
                year = int(cars[i]["data-year"])
                condition = cars[i]["data-type"]
                if soup.select(f'div[data-index-position*="{count}"] dl[class*=last] dt')[0].string == "Interior Color:":
                    int_color = soup.select(f'div[data-index-position*="{count}"] dl[class*=last] dd')[0].text.strip(",")
                elif soup.select(f'div[data-index-position*="{count}"] dl[class*=last] dt')[1].string == "Interior Color:":
                    int_color = soup.select(f'div[data-index-position*="{count}"] dl[class*=last] dd')[1].text.strip(",")
                else:
                    int_color = None
                if cars[i]["data-exteriorcolor"] == cars[i]["data-bodystyle"]:
                    ext_color = None
                else: 
                    ext_color = cars[i]["data-exteriorcolor"]
                engine = soup.select(f'div[data-index-position*="{count}"] dd')[0].text.strip(",")
                tran = soup.select(f'div[data-index-position*="{count}"] dd')[1].text.strip(",")
                drive = None
                fuel = None
                miles = int(soup.select(f'div[data-index-position*="{count}"] dd')[2].text.replace(' miles', '').replace(',', ''))
                price = soup.select(f'div[data-index-position*="{count}"] span.final-price span.value')[0].string
                if price == 'Please Call':
                    price = None
                else:
                    price = int(soup.select(f'div[data-index-position*="{count}"] span.final-price span.value')[0].string.strip('$').replace(',',''))
                msrp = None
                state = state
                city = city
                car = [make, model, trim, body, vin, year, condition, 
                    int_color, ext_color, engine, tran, drive, 
                    fuel, miles, price, msrp, state, 
                    city, latitude, longitude, dealership, Date]
                dataset = np.vstack([dataset, car])

                print(index_count)
                count += 1
                
            extension += limit
        # convert to pandas
        df = pd.DataFrame(dataset)

        # make first row as header
        df.columns = df.iloc[0]
        df = df[1:]
        print("Complete!")
        return df

There are a couple of functions that I use in this class to get the number of pages in each website and the location of each store:


# Calculates the number of pages in the website
def get_page_limit(inv_count, limit):
    """
    @param inv_count: [INT] Number of vehicles in the inventory \n
    @param limit: [INT] the number of cars listed on each page

    Returns three variables:

    inv_range: Returns the number of pages in the website
    
    extra: Sets to true if there is a page that doesn't have the full limit of cars listed

    out_of_limit: Returns remainder of int_count/limit
    """
    extra = False
    out_of_limit = int(inv_count % limit) # gets the remainder 
    inv_range = 0 # sets variable
    if out_of_limit > 0:
        inv_range = int((inv_count / limit) + 1) # gets page amount when inventory is not divisible by limit
        extra = True
    else:
        inv_range = int(inv_count / limit) # gets page amount when inventory count is divisible by limit
    return inv_range, extra, out_of_limit
    
def get_lat_lon(state, city, zip_code, street):
    """
    @param state: Provide State in the USA

    @param city: Provide City in the state provided

    @param zip_code: Provide the zipcode for the desired location

    @param street: Provide the street address in the provided zipcode

    Returns latitude first and then longitude

    This function uses the state, city, zip code, and street address to 
    retrieve the latitude and longitude. It then returns both values. Two 
    variables are required for this function. 

    """
    # get variables from .env  file
    load_dotenv()

    # set API key from .env variables
    APIKEY = os.getenv('APIKEY')

    print("Requesting latitude and longitude...")
    locator = Bing(APIKEY)
    location = locator.geocode(f"{state}, {city}, {zip_code}, {street}")
    latitude = location.latitude
    longitude = location.longitude
    print("Successfully retrieved latitude and longitude!")
    return latitude, longitude

Efficient Web Scraping