Site Reliability Engineer
Datadog
(New York, New York)Datadog is the leading service for cloud-scale monitoring. It is used by IT, operations, and development teams who build and operate applications that run on dynamic or high-scale infrastructure. Because Datadog collects metrics and events from 100+ different technologies and services out of the box, including other monitoring tools, you can monitor your entire stack in one place, without any gaps.
- Keep our service reliable, available and fast as a member of the operations team.
- Respond to, investigate and fix service issues, whether they be deep in the OS kernel or in the application code.
- Design, build and maintain the infrastructure we need to support orders of magnitude more customers.
Who you must be
- You have a BS/MS/PhD in a scientific field
- You have a track record as an engineer in the operations of a large site
- You value correctness and efficiency; you leave no stone unturned when diagnosing production issues
- You handle infrastructure with code because automation lets you focus on the more difficult and rewarding problems
- You have production experience with distributed compute/storage tools, e.g. zookeeper, cassandra, postgres, kafka, elasticsearch redis
Bonus Points
- You have submitted bug fixes to the aforementioned projects
- You are fully fluent in python, ruby and go
Questions
There are no answered questions, sign up or login to ask a question

Want to see jobs that are matched to you?
DreamHire recommends you jobs that fit your
skills, experiences, career goals, and more.