Site Reliability Engineer (SRE)

Boston, Massachusetts, United States Full-time

Do you enjoy the types of challenges at work that not only scratch your career development itches but also have an immediate and measurable impact? Here at Quantopian, we have an opportunity for a skilled and passionate Site Reliability Engineer that’s looking to tackle the kinds of projects that arise in a quickly growing startup environment. You’ll have the opportunity to develop high-level strategies such as making orchestration architecture decisions, and to work in the trenches honing your coding skills building infrastructure automation. You will work closely with developers to implement best practices, helping them understand the tradeoffs between technology choices and encouraging standardization on reusable solutions. 

At Quantopian, we value the role our SRE team plays in maintaining a robust and scalable infrastructure. We recognize that establishing operational processes is important because we don’t want the same mistake repeated multiple times. And we appreciate that occasionally pushing back can be instrumental towards reducing complexity. We get that “DevOps” is a cultural movement and not a skill set and we embrace the ideals of respect and interconnectedness required for all of our Engineers to achieve great things together.

Who is Quantopian? We're a small team of smart, dedicated engineers and technophiles. We're building the world's first crowd-sourced hedge fund, and we're growing our team to support our rapidly expanding user base and add the features our users are clamoring for. We code in Python, Ruby, bash, and anything else we need to get the job done. Our backend runs on AWS and our frontend is on Heroku. In AWS, we deploy on top of Ubuntu. We store our data in MongoDB, PostgreSQL, Cassandra, Redis, and Memcache. We ship early and often. We have lives outside of work. We like each other.

In the job, you should expect to: 

Deploy and manage our full stack of public cloud-based services

  • You’ll build tools in python and use open-source software such as Ansible, CloudFormation and Consul
  • Utilize the years of experience you’ve had administering web services, managing databases and deploying on Linux.
  • Continue building your existing practical knowledge of AWS and other cloud-based services

Build up our metrics-gathering and alert-generating services

  • You’ll leverage graphite, statsd and CloudWatch to ensure platform and application metrics are easily deployable, scalable and usable

Hunt for Single Points of Failure (SPOF) in our infrastructure and deployment methodologies

  • You’ll bring the battle scars that will help you identify potential SPOF before they’re deployed to production

Convert manual processes into automated tasks

  • You’ll think: why are these devs building an AMI by hand? I’m sure there are ways we could automate this, and why not make it part of our CI pipeline?

Write code to glue together various services

  • We’re mostly a service-oriented architecture, which means there’s plenty of orchestration involved. You can understand a complex dependency graph and figure out how design decisions would affect platform manageability.

Become part of our 24/7 on-call rotation

  • We use Pagerduty to ensure that important alerts only wake you up when you’re on call. When you wake up, you’ll be able to get up to speed quickly and determine who can help you when you can’t DIY.

 

You should not expect to:

  • Have financial/asset management knowledge
  • Know everything about all the things (but you should know a lot about at least one thing)
  • Be an expert programmer or a unicorn