A Failed Attempt – Alina

This past week marks the end of the data collection period of my project. After I figured out how to scrape data generally on websites with simple structures in the last blog post, I had been experimenting with pulling data down from the Expedia website which was way more complex. However, as I tried to do this, I encountered some difficulties. At first, I decided to start experimenting with data that should be easily pulled to see if the code would indeed even work for this site. Therefore, I picked the date of the flight shown on the website. It had the tag class=”title-date-rtv“. I put this value into the code.

Screen Shot 2018-11-03 at 5.39.05 PM

I don’t recall what exactly the first few runs returned, but I remember that I didn’t get the date that was printed on the page. When I gave it another try, I got a different error message – HTTP Error 429: Too Many Requests.

Screen Shot 2018-10-31 at 5.14.23 PM

Even though the name was pretty self-explanatory, I wanted to know more about what it meant and if there was a way to fix my code to avoid this error, so I looked it up online. What I learned was that this error usually occurs because the websites set a restriction on the times of requests there could be from each IP address during a certain time period. This is a way for the websites to protect themselves from malicious attacks. In some cases, it is also possible that you haven’t really made a lot of requests in a short amount of time but that is just what the server returned as the error. No conclusion can be made in that case based on that message. On StackOverflow the advice given by people was that you should not try to get around it because spamming the websites would be considered an unethical even illegal behavior. The option given was that if your code was running in loops, then you should “sleep your process” which means to use a function to put your process on pause for a certain length of time after each time it runs to avoid overwhelming the server. However, if you were like me and your code wasn’t on a loop, there really isn’t much you could do if you wanted to get a result at that moment apart from maybe using a VPN with a different IP address. My only option was to wait for a period of time until it was okay for me to do another request. The problem was I didn’t know what amount of time I was supposed to wait for.  When I tried again 10 minutes later, it returned the same error message. I consulted T. Tom, my mentor, to see if he had any advice. He said I should try again tomorrow, so that was what I did. This time, I finally got the date printed as I expected.

Screen Shot 2018-10-31 at 10.38.06 PM

Now I know that my method of collection was viable, I decided to try and see if I could finally pull data of the prices. I found the tag of the price section which was class=”full-bold no-wrap”

Screen Shot 2018-11-04 at 4.38.11 PM

I put this in my code.

Screen Shot 2018-11-04 at 4.40.33 PM

Again, I got the error message of Too Many Requests. When I tried again, the result returned was “None” which meant that the program didn’t find any element with the given tag which I didn’t understand. So I tried to run it again, and this time, again it gave me the same too-many-request error. The main issue I had was that there was no way for me to test out my code efficiently and modify it because of the amount of time it required me to wait.

Therefore, I decided that for the moment, since I am at the end of the data collection period, I am going to shift my focus to the analysis of the data. Although I am going to continue to explore the possibility of pulling data off of the Expedia site with my code on the side.

This is the data that I have to work with now. It is by no means a big data set, but it should be enough for my purposes.

Screen Shot 2018-11-04 at 5.15.38 PM

Screen Shot 2018-11-04 at 5.14.47 PM

I have also begun investigating the ways in which I can analyze the data. I have been reading Visualize This by Nathan Yau. Also, I have been following a tutorial/course on codecademy on the topic of data analysis. More on that in my next blog!

 

Sources:

“429 Too Many Requests.” Http Statuses, httpstatuses.com/429. Accessed 4 Nov. 2018.

“How to avoid HTTP error 429 (Too Many Requests) python.” Stack Overflow, 1 Apr. 2014, stackoverflow.com/questions/22786068/how-to-avoid-http-error-429-too-many-requests-python. Accessed 4 Nov. 2018.

“Introduction to Data Analysis.” Codecademy, http://www.codecademy.com/programs/d4ca904f105f85fdb149aaa77d3c011b/items/c36432f958dcd000126c1bc58240a619. Accessed 4 Nov. 2018.

Yau, Nathan. Visualize This: The FlowingData Guide to Design, Visualization, and Statistics. Indianapolis, Wiley, 2011.

2 thoughts on “A Failed Attempt – Alina

  1. Dhillon

    Hi Alina, this is really cool. Coding is the future of our world. It was really fascinating learning about how you encountered a problem, and rather than turning away you searched into it by learning what it meant adn then testing various methods to try and solve the problem. Keep up the great work! I am excited to see the finsihed product.

    Reply
  2. baitingz

    Hi Alina, congrats on finishing pulling down all your data! Since I have been following up your blog for quite a while, I have several questions. You said in this blog that you sent too many requests to the website, did you check on something called the API? I know that some big websites like Google provides API for data pulling. They may charge you a little money if you need a lot of requests in one day. I worked on a math model that required a lot of data with Kevin Wang, so I think Kevin knows a lot about this, maybe you can also try to reach out to him? And one more thing, I see that you are starting the analysis process, and I have a small suggestion for you. Maybe you should start with the basic definitions in Statistics like standard deviation, normal distribution, and regression. Then I would suggest you to “pre-treat” your data before running any models. In general, I think all of your variables are consecutive and automatically quantified, so hope you will find the process not very stressful!

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.