Data Collection Continued – Alina

Updated Table

If you remembered from my last blog, my focus in this project has recently been on figuring out a way to scrape data off of the travel websites using code instead of doing it manually since it is indeed a tedious job. Of course, while working on the code, I have also kept with the primitive collecting method since data collection is the objective of this month’s work in my project. So here’s an updated version of my data table:

Screen Shot 2018-10-20 at 5.24.57 PMScreen Shot 2018-10-20 at 5.25.14 PM

You might notice that I asterisked (*)  one date in each table. For 10/11 in the expedia table, the reason for the asterisk is that since that date, the particular flight that I had been using for the June 4th date somehow disappeared from the search results, so I had to switch to a different flight at a similar time. I doubt that the reason the flight disappeared is that the tickets are sold out since the flight is more than 8 months away, and it is still available on the Chinese site. I think it is probably a choice of Expedia, the reason of which is unknown to me. The asterisks in the second table are there because the RMB to USD exchange rate dropped a lot on October 9th (from 1 yuan = 0.15 USD to 1 yuan = 0.14 USD) which resulted in the sudden rise in the price when converted to USD. The exchange rate has since then stayed at that low point, and there hasn’t been any large-scale fluctuations.  


Data Scraping

After the update on the data I have collected, I am going to tell you about how to data scrape using Python code and a powerful library called Beautiful Soup. Through this process, I mainly followed a tutorial  online. For this example, I am going to scrape from a Wikipedia page.

The first step, after I have opened a text editor window,  is to import the libraries.

Screen Shot 2018-10-20 at 9.29.08 PM

Then, I need to tell the program what the URL of the website I am scraping is.

Screen Shot 2018-10-20 at 9.28.32 PM

The following code gives me the HTML of the page I just provided the URL for.

Screen Shot 2018-10-20 at 9.34.52 PM

The HTML page must first be parsed for Beautiful Soup to work with it.

Screen Shot 2018-10-20 at 9.48.00 PM

Then, I need to go and use the inspect function of the browser to find the element I want in the HTML file of the page. In this case, I want to find the “interaction” label in the sidebar.

Screen Shot 2018-10-20 at 9.39.22 PM

The unique “id” or “class” in the HTML tag of the element will be used to help the program find the element desired .

Screen Shot 2018-10-20 at 9.49.53 PM

In this case, the “id” is “p-interaction-label”, so we use that in our code to locate the HTML code we want.

Screen Shot 2018-10-20 at 9.52.02 PM

The next part is to strip the actual text from the HTML code. This is also the part I struggled with the most. In the beginning, this is the code I had.

Screen Shot 2018-10-20 at 9.54.48 PM

But it kept returning an error message.

Screen Shot 2018-10-20 at 9.55.42 PM

So then, I deleted the “stripe()” part and ran it again, and now it works.

Screen Shot 2018-10-20 at 11.10.39 PM

Now that I have figured out the basic way to scrape data off of a website, my next step is to advance this code and apply it to the travel websites. On the travel websites, the date for the flight searched is included in the URL.

Screen Shot 2018-10-20 at 11.14.49 PM

So when I want to get the data for different dates what I have to do is incorporate variables into the URL and then set a loop to go through all the dates I want the program to get data for.

My plan moving forward is that each day I would get the bulk of data for all the United flights departing PHL and arriving in PEK, so I would have more data to work with. This couldn’t be done before, since recording that amount of data manually every day would take way too long. Next time, with the larger dataset, I am hoping that there will be some unexpected discoveries.

 

Sources:

Yau, Nathan. Visualize This: The FlowingData Guide to Design, Visualization, and Statistics. Indianapolis, Wiley Publishing, 2011.

Yek, Justin. “How to Scrape Websites with Python and BeautifulSoup.” FreeCodeCamp, Medium, 10 June 2017, medium.freecodecamp.org/ how-to-scrape-websites-with-python-and-beautifulsoup-5946935d93fe. Accessed 21 Oct. 2018.

2 thoughts on “Data Collection Continued – Alina

  1. baitingz

    Hi Alina, very interesting introduction on how to scrap data from a website. I think you did a great job in terms of explaining you codes and logic! Since I am not a Comp Sci person, I have a general question: do you need to send multiple requests to the website to find the data (as you are frequently changing you date)? And also, if you know anything about API, could you please give us a brief introduction of it? Did you use any of the APIs in your codes? Furthermore, other than collecting data, have you decided you analytic model(s)? Please share that with me as I am super excited to know which Comparison/Evaluation model(s) you choose!!

    Reply
  2. Susan C Waterhouse

    It is interesting to think about why Expedia might stop showing a flight (will it start to show the flight again after a while.. or not show it again ever)? I wonder if you will be able to tell how often the flights that are available change when you do larger data collections from this websites. How will that play into the question about whether the price changes often etc?

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.