This post covers how I found my way through web scraping and explains the importance of building efficient scripts that save time.
For my Senior Project, I am collecting car dealership inventory data that I can use to create a dashboard in PowerBI to see which cars are being sold where and for which price. In order to get the data I have to webscrape each dealership’s data.
Web scraping is a very useful skill for collecting data when an API is not available.
If you are not familiar with how HTML and CSS work then it might be difficult for you to figure out how to get the specific data that you want. Web scraping will be a lot more manageable when you understand exactly what you are looking for instead of having to rely on documentation to get through everything. Not all data has a Class
or an ID
attribute in the HTML tag, so you will need to know all about the different CSS Selectors.
You can use multiple programming languages to web scrape; however, I am mostly familiar with Python and R.
If you are skilled in the R language, there is a supported package for web scraping called Rvest. I have used it to scrape data from BoardGameGeek and it is very fluent and easy to use. (Then after hours of creating a script that loops through their website and collects all of their data, I found out that you can find BoardGameGeek data in a sqlite dataset available to the public… but the point is that Rvest works great.)
Python is used heavily for web scraping and there are many libraries available to help you get what you need from any website. After reading through many of the different web scraping libraries for Python, I decided to go with Beautiful Soup. Beautiful Soup is an HTML parser and consists of many different functions that make extracting data simple.
I chose Python to complete this project. Object Oriented Programming is very helpful when web scraping and Python does that very well. For the scope of my project, I am going to want to find any pattern I can in each website to tie them together so that I can build an object that does all the work for me. Unless you want to write a script for every individual website, you will want to follow that advice. (It is possible to create objects in R, but I prefer to use Python)
Honestly, when I started I thought that I would only need to learn how to use Beautiful Soup" and the rest would be golden. One of the issues I ran into was that Beautiful Soup doesn’t render JavaScript. So, any data that you need that is generated by JavaScript will not show up in your Beautiful Soup selectors. There are other libraries that you can use to get this done. One of them is Selenium and another is Requests-HTML. (For those of you coding in R, there is a version of Selenium for R)
When you load a website, all of the HTML and the CSS is generated and then the JavaScript runs and generates its content. Parsers like Beautiful Soup will open and parse the website without waiting for the javascript to render. For this reason, it cannot be used to gather data from JavaScript.
When scraping, I will use Beautiful Soup for the majority of content and then only use one of the JavaScript parsers for the few cases in which JavaScript renders the information on the page. Fortunately, on most of the websites I have scraped, the majority of content is generated with the HTML and CSS.
I have been experimenting with both Requests-HTML and Selenium. I started with Selenium but then backed out because it requires you to install a little more than just a library so I started using Requests-HTML. Requests-HTML worked for me until I tried running it in an Object. It threw a few errors and told me to use the Async function. I tested out the Async function and that gave me other errors that I have not yet resolved. If I cannot figure out how to use Requests-HTML’s Async function, I will go back to using Selenium. Requests-HTML did give me a couple of problems at times. It would try to render the website and fail some of the time, which would completely break my code, so I had to tell it to try again if it failed. That seemed to have resolved that issue.
Like I said before, it is a great idea to find any websites that have similar HTML and CSS structure. I found that car dealerships tend to have similar site structures. For example, a lot of Ford dealerships have similar structures as each other and Subaru websites have similar structures. Over the last month, one of the Ford dealerships I was successfully scraping changed their website so their site broke every time I ran the script. After looking through their new structure, I found that it was exactly the same as the Subaru website I was looking at. Creating a class for each type of HTML structure will save you a lot of time.
Then there is the problem with having to flip through multiple pages of a website. Libraries like Selenium actually have functions that allow you to do this, but I got mine to work just fine while just using Beautiful Soup. You will want to look for any indicator in the URL that you can use to tell which page you are on. After clicking on page two, look up at the URL and it might have something similar to ‘?page=2’ or ‘?start=15’. From there you need to determine the pattern they are using and then create a loop that increments with that pattern and parses the URL with the new value. Here is an example using sudo code:
= 1
page = f'https://www.mysite.com?page={page}'
url
for loop:
= render(url)
r 'class=WebsiteClass')
r.select(
+= 1 page
This code dynamically sets the page number in the URL and then parses that website. It then gets a value on the first page, increments the page number to 2, and loops the process until it gets to the last page. You will need some way for it to know how many pages are on the website. In my case, I scraped the number of vehicles in the inventory and luckily the URL actually gives me the number of vehicles on the page instead of the page number. From there I do math to figure out how many pages there are by dividing the inventory count by the number of vehicles on the page.
There are plenty of open source materials to get you on your way in web scraping. There is even the stack overflow community who are ready to answer your questions when you can’t find the answers online. Keep in mind that when you are web scraping, people do change their websites. I advise you to make your code as debug friendly as possible because it may be working today but tomorrow it will probably break.
Here is a class that I created to get data from dealerships:
class vistadash:
def __init__(self, urlNew, urlUsed, limit, dealership, Date, urlNewAddOn="", urlUsedAddOn=""):
"""
@param urlNew: The url that requests all of the new inventory
@param urlUsed: The url that requests all of the used inventory
@param limit: The number of cars listed per page
@dealership: The name of the dealership
@param Date: The timestamp that this was ran
--------------------------------------------------------------
@param urlNewAddOn: This is any extension to the new url
@param urlUsedAddOn: This is any extension to the used url
How the url should be inserted: 'https://www.dealership.com/new-inventory/index.htm'
The code will automatically add '?start=0&' to the url for looping purposes
Any part of the url that comes after 'https://www.dealership.com/new-inventory/index.htm?start=0&' should be inserted to the @param urlAddOn
--------------------------------------------------------------
This class creates a web scraping object that is specialized in
scraping car dealership websites that are based on VistaDash.
"""
self.urlNew = urlNew
self.urlUsed = urlUsed
self.limit = limit
self.dealership = dealership
self.Date = Date
self.urlNewAddOn = urlNewAddOn
self.urlUsedAddOn = urlUsedAddOn
def scrape(self):
"""
This function scrapes the website for all of the inventory data.
"""
#Already set columns
= self.dealership
dealership = self.Date
Date
print(dealership)
#Data set headers
= np.array(['make', 'model', 'trim', 'body', 'vin', 'year', 'condition',
dataset 'int_color', 'ext_color', 'engine', 'tran', 'drive', 'fuel', 'miles',
'price', 'msrp', 'state', 'city', 'latitude', 'longitude', 'dealership', 'date'])
# parse website
print("Parsing HTML...")
= self.limit
limit = 0
extension
if len(self.urlNewAddOn) > 0:
= f'{self.urlNew}?start={extension}&{self.urlNewAddOn}'
url else:
= f'{self.urlNew}?start={extension}&'
url
= requests.get(url)
website = BeautifulSoup(website.content, 'html.parser')
soup print("Successfully parsed HTML!")
# get address from website
= soup.select('span[class*=street-address]')[0].string
street = soup.select('span[class*=locality]')[0].string
city = soup.select('span[class*=region]')[0].string
state = soup.select('span[class*=postal-code]')[0].string
zip_code
= f.get_lat_lon(state, city, zip_code, street)
latitude, longitude
### NEW INVENTORY ###
print("NEW INVENTORY")
# get inventory count
= int(soup.select('span[class*=vehicle-count]')[0].string)
inventory_count
# Get page limit from url
= f.get_page_limit(inventory_count, limit)
inv_range, extra, out_of_limit
print("Starting web scrape for new inventory data...")
= 0
index_count
for x in range(inv_range):
print("Parsing HTML...")
= requests.get(url)
website = BeautifulSoup(website.content, 'html.parser')
soup print("Successfully parsed HTML!")
= soup.find_all(attrs={"class" : "hproduct"}) # get data from attributes
cars
= 1
count
= int((extension/limit)+1) # counts which page the scrape is on
page_count print(f"Page: {page_count} of {inv_range}")
if page_count == inv_range and extra == True:
= out_of_limit
limit
for i in range(limit):
+= 1
index_count = cars[i]["data-make"]
make = cars[i]["data-model"]
model = cars[i]["data-trim"]
trim = cars[i]["data-bodystyle"]
body = cars[i]["data-vin"]
vin = int(cars[i]["data-year"])
year = cars[i]["data-type"]
condition if soup.select(f'div[data-index-position*="{count}"] dl[class*=last] dt')[0].string == "Interior Color:":
= soup.select(f'div[data-index-position*="{count}"] dl[class*=last] dd')[0].text.strip(",")
int_color elif soup.select(f'div[data-index-position*="{count}"] dl[class*=last] dt')[1].string == "Interior Color:":
= soup.select(f'div[data-index-position*="{count}"] dl[class*=last] dd')[1].text.strip(",")
int_color else:
= None
int_color = cars[i]["data-exteriorcolor"]
ext_color = soup.select(f'div[data-index-position*="{count}"] dd')[0].text.strip(",")
engine = soup.select(f'div[data-index-position*="{count}"] dd')[1].text.strip(",")
tran = None
drive = None
fuel = None
miles = soup.select(f'div[data-index-position*="{count}"] span.final-price span.value')[0].string
price if price == 'Please Call':
= None
price else:
= int(soup.select(f'div[data-index-position*="{count}"] span.final-price span.value')[0].string.strip('$').replace(',',''))
price try:
= soup.select(f'div[data-index-position*="{count}"] span.retailValue span.value')[0].string
msrp if msrp == 'Please Call':
= None
msrp else:
= int(soup.select(f'div[data-index-position*="{count}"] span.retailValue span.value')[0].string.strip('$').replace(',',''))
msrp except:
= int(soup.select(f'div[data-index-position*="{count}"] span.msrp span.value')[0].string.strip('$').replace(',',''))
msrp = state
state = city
city = [make, model, trim, body, vin, year, condition,
car
int_color, ext_color, engine, tran, drive,
fuel, miles, price, msrp, state,
city, latitude, longitude, dealership, Date]= np.vstack([dataset, car])
dataset
print(index_count)
+= 1
count
+= limit
extension
### USED INVENTORY ###
print("USED INVENTORY")
# parse website
= self.limit
limit = 0
extension if len(self.urlUsedAddOn) > 0:
= f'{self.urlUsed}?start={extension}&{self.urlUsedAddOn}'
url else:
= f'{self.urlUsed}?start={extension}&'
url print("Parsing HTML...")
= requests.get(url)
website = BeautifulSoup(website.content, 'html.parser')
soup print("Successfully parsed HTML!")
# get inventory count
= int(soup.select('span[class*=vehicle-count]')[0].string)
inventory_count
# Get page limit from url
= f.get_page_limit(inventory_count, limit)
inv_range, extra, out_of_limit
print("Starting web scrape for used inventory data...")
= 0
index_count
for x in range(inv_range):
print("Parsing HTML...")
= requests.get(url)
website = BeautifulSoup(website.content, 'html.parser')
soup print("Successfully parsed HTML!")
= soup.find_all(attrs={"class" : "hproduct"}) # get data from attributes
cars
= 1
count
= int((extension/limit)+1) # counts which page the scrape is on
page_count print(f"Page: {page_count} of {inv_range}")
if page_count == inv_range and extra == True:
= out_of_limit
limit
for i in range(limit):
+= 1
index_count = cars[i]["data-make"]
make = cars[i]["data-model"]
model = cars[i]["data-trim"]
trim = cars[i]["data-bodystyle"]
body = cars[i]["data-vin"]
vin = int(cars[i]["data-year"])
year = cars[i]["data-type"]
condition if soup.select(f'div[data-index-position*="{count}"] dl[class*=last] dt')[0].string == "Interior Color:":
= soup.select(f'div[data-index-position*="{count}"] dl[class*=last] dd')[0].text.strip(",")
int_color elif soup.select(f'div[data-index-position*="{count}"] dl[class*=last] dt')[1].string == "Interior Color:":
= soup.select(f'div[data-index-position*="{count}"] dl[class*=last] dd')[1].text.strip(",")
int_color else:
= None
int_color if cars[i]["data-exteriorcolor"] == cars[i]["data-bodystyle"]:
= None
ext_color else:
= cars[i]["data-exteriorcolor"]
ext_color = soup.select(f'div[data-index-position*="{count}"] dd')[0].text.strip(",")
engine = soup.select(f'div[data-index-position*="{count}"] dd')[1].text.strip(",")
tran = None
drive = None
fuel = int(soup.select(f'div[data-index-position*="{count}"] dd')[2].text.replace(' miles', '').replace(',', ''))
miles = soup.select(f'div[data-index-position*="{count}"] span.final-price span.value')[0].string
price if price == 'Please Call':
= None
price else:
= int(soup.select(f'div[data-index-position*="{count}"] span.final-price span.value')[0].string.strip('$').replace(',',''))
price = None
msrp = state
state = city
city = [make, model, trim, body, vin, year, condition,
car
int_color, ext_color, engine, tran, drive,
fuel, miles, price, msrp, state,
city, latitude, longitude, dealership, Date]= np.vstack([dataset, car])
dataset
print(index_count)
+= 1
count
+= limit
extension # convert to pandas
= pd.DataFrame(dataset)
df
# make first row as header
= df.iloc[0]
df.columns = df[1:]
df print("Complete!")
return df
There are a couple of functions that I use in this class to get the number of pages in each website and the location of each store:
# Calculates the number of pages in the website
def get_page_limit(inv_count, limit):
"""
@param inv_count: [INT] Number of vehicles in the inventory \n
@param limit: [INT] the number of cars listed on each page
Returns three variables:
inv_range: Returns the number of pages in the website
extra: Sets to true if there is a page that doesn't have the full limit of cars listed
out_of_limit: Returns remainder of int_count/limit
"""
= False
extra = int(inv_count % limit) # gets the remainder
out_of_limit = 0 # sets variable
inv_range if out_of_limit > 0:
= int((inv_count / limit) + 1) # gets page amount when inventory is not divisible by limit
inv_range = True
extra else:
= int(inv_count / limit) # gets page amount when inventory count is divisible by limit
inv_range return inv_range, extra, out_of_limit
def get_lat_lon(state, city, zip_code, street):
"""
@param state: Provide State in the USA
@param city: Provide City in the state provided
@param zip_code: Provide the zipcode for the desired location
@param street: Provide the street address in the provided zipcode
Returns latitude first and then longitude
This function uses the state, city, zip code, and street address to
retrieve the latitude and longitude. It then returns both values. Two
variables are required for this function.
"""
# get variables from .env file
load_dotenv()
# set API key from .env variables
= os.getenv('APIKEY')
APIKEY
print("Requesting latitude and longitude...")
= Bing(APIKEY)
locator = locator.geocode(f"{state}, {city}, {zip_code}, {street}")
location = location.latitude
latitude = location.longitude
longitude print("Successfully retrieved latitude and longitude!")
return latitude, longitude
For attribution, please cite this work as
Sant (2021, March 20). Data Science with Keaton: Efficient Web Scraping. Retrieved from https://keatonjsant.github.io/posts/2021-03-08-sppart2/
BibTeX citation
@misc{sant2021efficient, author = {Sant, Keaton}, title = {Data Science with Keaton: Efficient Web Scraping}, url = {https://keatonjsant.github.io/posts/2021-03-08-sppart2/}, year = {2021} }