Serverless ETL for Sirocco on Google Cloud
I published an architecture of a serverless ETL solution for Sirocco on the GCP Big Data blog. This solution scales from a few news articles in a cloud bucket to millions of news posts in a database, taking advantage of Cloud Dataflow’s autoscaling features. With this blog you now should have all the components for building a news monitoring or a opinion tracking solution. I know it because I am using exactly the same setup for an actual news monitoring solution — more about it in a future post.
Here is what I suggest you do:
Read about Plutchik’s framework for Emotion analysis to understand the theory behind this solution
Read about the ETL solution
Go to the github repo and follow the instructions in README. Set up your own processing pipeline and run a test crunching a few news articles that I uploaded to the test folder.