top of page
Search
  • mattostanik

How to use Google Cloud to analyze a lot of data

Updated: Feb 15, 2019



Recently I was working on a data analysis project with a very large volume of data. I had written a Python application that analyzed the performance of a set of 2,500 stock tickers over a 60 day period. My application included a series of nested loops, and in total it took about 10 minutes to run for each ticker.


This was way too long for me (at 10 minutes each, it would have taken 17 days to finish the analysis). I re-wrote the loops as a single PostreSQL query instead. This had a dramatic impact, reducing the time from 10 minutes per ticker to instead only 45 seconds per ticker.


Initially I was excited about the improvement, but then I needed to expanded my analysis to cover an entire year instead of only 2 months. Running the query for an entire year of data took about 6 minutes per ticker. Again this was much more time than I had patience for!


My solution was to break my data set into multiple segments that could each be analyzed on separate servers running on Google Cloud Platform. In lieu of having my local computer run the analysis for 10 days, I quickly and inexpensively started up 10 database servers on the Google Cloud Platform that finished the analysis in 1 day.


Getting started with Google Cloud


Google Cloud Platform is an overall title that covers many different services from Google. To start up a server that you can run Python or other applications on, Google Cloud Engine is the place to go. Once you have created a Google Cloud Platform account, you can navigate to the Google Cloud Engine to create your first instance.


When creating your first instance, pick the zone/region that is closest to you. I typically use the default/standard instance sizes unless I know I have a specific need for something different.


After your first instance is running, you also need to go to Networking > VPC Network > External IP addresses to setup an external address for the instance. Google Cloud allows for one external IP address per zone. This works fine if you use your initial instance as the primary access point and the control point to run Python scripts for the data analysis.


Adding your database servers


Then go to SQL and select to create a new instance. This allows you to create a database instance. I personally have more experience with PostgreSQL but other options are available as well.


Once your database is created, you need to adjust the network settings to allow access to it from two locations:

  1. Your local computer (by looking up your IP address from your office or work location)

  2. The instance you created in the previous step (using the external IP address that was assigned for it)

You will then want to connect directly to the databases and configure your tables as needed for the project. I prefer to use DataGrip for this work.


Copy your files and start the analysis


In my example above, I took my data set and divided it up into 10 segments for the 10 database servers I created on Google Cloud to work on. I also created 10 copies of my Python analysis script with each copy connecting to a different database. This was a quick project for me, but in larger projects, you could revise a single copy of the Python application to manage all of the connections internally.


Then go the terminal or command line on your local computer. Login to Google Cloud by typing:


gcloud auth login


Then I can copy my Python files to the remote instance using this command:


gcloud compute scp example instance-1:~/


Where example is the name of my file. You can insert multiple files names in here if desired to copy them at the same time.


Then you can SSH to your instance, similar to this:


ssh instance-1.us-yourzone-yourname-xyz-123456


Once you are connected to the terminal on the remote instance, you can execute the Python script(s) that you previously transferred. Don’t forget to make sure you have the correct version of Python installed on the virtual instance and that you have also installed any packages that are required by your analysis script.


There are many additional things you can do with Google Cloud, and this is just a simple starting point. If you’d like to learn more, feel free to contact me.

64 views0 comments

Recent Posts

See All

© 2020 All Rights Reserved

Proudly created with wix.com

bottom of page