Author Archives: Alina Zhao

A Failed Attempt – Alina

This past week marks the end of the data collection period of my project. After I figured out how to scrape data generally on websites with simple structures in the last blog post, I had been experimenting with pulling data down from the Expedia website which was way more complex. However, as I tried to do this, I encountered some difficulties. At first, I decided to start experimenting with data that should be easily pulled to see if the code would indeed even work for this site. Therefore, I picked the date of the flight shown on the website. It had the tag class=”title-date-rtv“. I put this value into the code.

Screen Shot 2018-11-03 at 5.39.05 PM

I don’t recall what exactly the first few runs returned, but I remember that I didn’t get the date that was printed on the page. When I gave it another try, I got a different error message – HTTP Error 429: Too Many Requests.

Screen Shot 2018-10-31 at 5.14.23 PM

Even though the name was pretty self-explanatory, I wanted to know more about what it meant and if there was a way to fix my code to avoid this error, so I looked it up online. What I learned was that this error usually occurs because the websites set a restriction on the times of requests there could be from each IP address during a certain time period. This is a way for the websites to protect themselves from malicious attacks. In some cases, it is also possible that you haven’t really made a lot of requests in a short amount of time but that is just what the server returned as the error. No conclusion can be made in that case based on that message. On StackOverflow the advice given by people was that you should not try to get around it because spamming the websites would be considered an unethical even illegal behavior. The option given was that if your code was running in loops, then you should “sleep your process” which means to use a function to put your process on pause for a certain length of time after each time it runs to avoid overwhelming the server. However, if you were like me and your code wasn’t on a loop, there really isn’t much you could do if you wanted to get a result at that moment apart from maybe using a VPN with a different IP address. My only option was to wait for a period of time until it was okay for me to do another request. The problem was I didn’t know what amount of time I was supposed to wait for.  When I tried again 10 minutes later, it returned the same error message. I consulted T. Tom, my mentor, to see if he had any advice. He said I should try again tomorrow, so that was what I did. This time, I finally got the date printed as I expected.

Screen Shot 2018-10-31 at 10.38.06 PM

Now I know that my method of collection was viable, I decided to try and see if I could finally pull data of the prices. I found the tag of the price section which was class=”full-bold no-wrap”

Screen Shot 2018-11-04 at 4.38.11 PM

I put this in my code.

Screen Shot 2018-11-04 at 4.40.33 PM

Again, I got the error message of Too Many Requests. When I tried again, the result returned was “None” which meant that the program didn’t find any element with the given tag which I didn’t understand. So I tried to run it again, and this time, again it gave me the same too-many-request error. The main issue I had was that there was no way for me to test out my code efficiently and modify it because of the amount of time it required me to wait.

Therefore, I decided that for the moment, since I am at the end of the data collection period, I am going to shift my focus to the analysis of the data. Although I am going to continue to explore the possibility of pulling data off of the Expedia site with my code on the side.

This is the data that I have to work with now. It is by no means a big data set, but it should be enough for my purposes.

Screen Shot 2018-11-04 at 5.15.38 PM

Screen Shot 2018-11-04 at 5.14.47 PM

I have also begun investigating the ways in which I can analyze the data. I have been reading Visualize This by Nathan Yau. Also, I have been following a tutorial/course on codecademy on the topic of data analysis. More on that in my next blog!

 

Sources:

“429 Too Many Requests.” Http Statuses, httpstatuses.com/429. Accessed 4 Nov. 2018.

“How to avoid HTTP error 429 (Too Many Requests) python.” Stack Overflow, 1 Apr. 2014, stackoverflow.com/questions/22786068/how-to-avoid-http-error-429-too-many-requests-python. Accessed 4 Nov. 2018.

“Introduction to Data Analysis.” Codecademy, http://www.codecademy.com/programs/d4ca904f105f85fdb149aaa77d3c011b/items/c36432f958dcd000126c1bc58240a619. Accessed 4 Nov. 2018.

Yau, Nathan. Visualize This: The FlowingData Guide to Design, Visualization, and Statistics. Indianapolis, Wiley, 2011.

Data Collection Continued – Alina

Updated Table

If you remembered from my last blog, my focus in this project has recently been on figuring out a way to scrape data off of the travel websites using code instead of doing it manually since it is indeed a tedious job. Of course, while working on the code, I have also kept with the primitive collecting method since data collection is the objective of this month’s work in my project. So here’s an updated version of my data table: Continue reading

The Website – Alina

how-to-setup-website

Hey guys, welcome back! You might remember that in my last blog I mentioned a study-hall sign-in website, which is part of the main focus of my project, and I promised to come back with more details on that, so here I am. Before I dive in though, I just want to provide some quick updates on my quest for answers regarding the manipulated plane ticket prices. Continue reading

The Pros and Cons of Big Data – Alina

bigdata-1080x675

Intro

This past summer, I flew back and forth between China and the US a lot, which meant I had to book plane tickets a number of times. During this process, I used a Chinese travel website called Ctrip, which my family has always liked and trusted. It is also the largest online travel agency in China. However, this time my experience was not so pleasant. The price for the tickets that I was looking at kept going up every time I returned from looking up similar tickets on other websites, which could be interpreted as normal since that price might go up as the date approached. The part that took me by surprise though was when I tried to log in using a different account and look at the same tickets on the same date, I found that the prices differed. I don’t recall the exact price gap but I just remember that it was enough for me to be upset and intrigued by it at the same time. After a brief search online, I found that there were already news reports accusing this company of manipulating their customers through the use of “big data.” This discovery deeply interested me. I could not help but started to wonder about questions like “How exactly are they using the data they collect to achieve their goal? How are other Internet companies like Netflix and Google using their data? What are the ethical implications of this? What impact does this have on our society as a whole?” Continue reading