If you remembered from my last blog, my focus in this project has recently been on figuring out a way to scrape data off of the travel websites using code instead of doing it manually since it is indeed a tedious job. Of course, while working on the code, I have also kept with the primitive collecting method since data collection is the objective of this month’s work in my project. So here’s an updated version of my data table:
You might notice that I asterisked (*) one date in each table. For 10/11 in the expedia table, the reason for the asterisk is that since that date, the particular flight that I had been using for the June 4th date somehow disappeared from the search results, so I had to switch to a different flight at a similar time. I doubt that the reason the flight disappeared is that the tickets are sold out since the flight is more than 8 months away, and it is still available on the Chinese site. I think it is probably a choice of Expedia, the reason of which is unknown to me. The asterisks in the second table are there because the RMB to USD exchange rate dropped a lot on October 9th (from 1 yuan = 0.15 USD to 1 yuan = 0.14 USD) which resulted in the sudden rise in the price when converted to USD. The exchange rate has since then stayed at that low point, and there hasn’t been any large-scale fluctuations.
After the update on the data I have collected, I am going to tell you about how to data scrape using Python code and a powerful library called Beautiful Soup. Through this process, I mainly followed a tutorial online. For this example, I am going to scrape from a Wikipedia page.
The first step, after I have opened a text editor window, is to import the libraries.
Then, I need to tell the program what the URL of the website I am scraping is.
The following code gives me the HTML of the page I just provided the URL for.
The HTML page must first be parsed for Beautiful Soup to work with it.
Then, I need to go and use the inspect function of the browser to find the element I want in the HTML file of the page. In this case, I want to find the “interaction” label in the sidebar.
The unique “id” or “class” in the HTML tag of the element will be used to help the program find the element desired .
In this case, the “id” is “p-interaction-label”, so we use that in our code to locate the HTML code we want.
The next part is to strip the actual text from the HTML code. This is also the part I struggled with the most. In the beginning, this is the code I had.
But it kept returning an error message.
So then, I deleted the “stripe()” part and ran it again, and now it works.
Now that I have figured out the basic way to scrape data off of a website, my next step is to advance this code and apply it to the travel websites. On the travel websites, the date for the flight searched is included in the URL.
So when I want to get the data for different dates what I have to do is incorporate variables into the URL and then set a loop to go through all the dates I want the program to get data for.
My plan moving forward is that each day I would get the bulk of data for all the United flights departing PHL and arriving in PEK, so I would have more data to work with. This couldn’t be done before, since recording that amount of data manually every day would take way too long. Next time, with the larger dataset, I am hoping that there will be some unexpected discoveries.
Yau, Nathan. Visualize This: The FlowingData Guide to Design, Visualization, and Statistics. Indianapolis, Wiley Publishing, 2011.
Yek, Justin. “How to Scrape Websites with Python and BeautifulSoup.” FreeCodeCamp, Medium, 10 June 2017, medium.freecodecamp.org/ how-to-scrape-websites-with-python-and-beautifulsoup-5946935d93fe. Accessed 21 Oct. 2018.