This past week marks the end of the data collection period of my project. After I figured out how to scrape data generally on websites with simple structures in the last blog post, I had been experimenting with pulling data down from the Expedia website which was way more complex. However, as I tried to do this, I encountered some difficulties. At first, I decided to start experimenting with data that should be easily pulled to see if the code would indeed even work for this site. Therefore, I picked the date of the flight shown on the website. It had the tag class=”title-date-rtv“. I put this value into the code.
I don’t recall what exactly the first few runs returned, but I remember that I didn’t get the date that was printed on the page. When I gave it another try, I got a different error message – HTTP Error 429: Too Many Requests.
Even though the name was pretty self-explanatory, I wanted to know more about what it meant and if there was a way to fix my code to avoid this error, so I looked it up online. What I learned was that this error usually occurs because the websites set a restriction on the times of requests there could be from each IP address during a certain time period. This is a way for the websites to protect themselves from malicious attacks. In some cases, it is also possible that you haven’t really made a lot of requests in a short amount of time but that is just what the server returned as the error. No conclusion can be made in that case based on that message. On StackOverflow the advice given by people was that you should not try to get around it because spamming the websites would be considered an unethical even illegal behavior. The option given was that if your code was running in loops, then you should “sleep your process” which means to use a function to put your process on pause for a certain length of time after each time it runs to avoid overwhelming the server. However, if you were like me and your code wasn’t on a loop, there really isn’t much you could do if you wanted to get a result at that moment apart from maybe using a VPN with a different IP address. My only option was to wait for a period of time until it was okay for me to do another request. The problem was I didn’t know what amount of time I was supposed to wait for. When I tried again 10 minutes later, it returned the same error message. I consulted T. Tom, my mentor, to see if he had any advice. He said I should try again tomorrow, so that was what I did. This time, I finally got the date printed as I expected.
Now I know that my method of collection was viable, I decided to try and see if I could finally pull data of the prices. I found the tag of the price section which was class=”full-bold no-wrap”
I put this in my code.
Again, I got the error message of Too Many Requests. When I tried again, the result returned was “None” which meant that the program didn’t find any element with the given tag which I didn’t understand. So I tried to run it again, and this time, again it gave me the same too-many-request error. The main issue I had was that there was no way for me to test out my code efficiently and modify it because of the amount of time it required me to wait.
Therefore, I decided that for the moment, since I am at the end of the data collection period, I am going to shift my focus to the analysis of the data. Although I am going to continue to explore the possibility of pulling data off of the Expedia site with my code on the side.
This is the data that I have to work with now. It is by no means a big data set, but it should be enough for my purposes.
I have also begun investigating the ways in which I can analyze the data. I have been reading Visualize This by Nathan Yau. Also, I have been following a tutorial/course on codecademy on the topic of data analysis. More on that in my next blog!
“429 Too Many Requests.” Http Statuses, httpstatuses.com/429. Accessed 4 Nov. 2018.
“How to avoid HTTP error 429 (Too Many Requests) python.” Stack Overflow, 1 Apr. 2014, stackoverflow.com/questions/22786068/how-to-avoid-http-error-429-too-many-requests-python. Accessed 4 Nov. 2018.
“Introduction to Data Analysis.” Codecademy, http://www.codecademy.com/programs/d4ca904f105f85fdb149aaa77d3c011b/items/c36432f958dcd000126c1bc58240a619. Accessed 4 Nov. 2018.
Yau, Nathan. Visualize This: The FlowingData Guide to Design, Visualization, and Statistics. Indianapolis, Wiley, 2011.