Site Reliability Engineer

Datadog

(New York, New York)
Full Time
Job Posting Details
About Datadog

Datadog is the leading service for cloud-scale monitoring. It is used by IT, operations, and development teams who build and operate applications that run on dynamic or high-scale infrastructure. Because Datadog collects metrics and events from 100+ different technologies and services out of the box, including other monitoring tools, you can monitor your entire stack in one place, without any gaps.

Responsibilities
  • Keep our service reliable, available and fast as a member of the operations team.
  • Respond to, investigate and fix service issues, whether they be deep in the OS kernel or in the application code.
  • Design, build and maintain the infrastructure we need to support orders of magnitude more customers.
Ideal Candidate

Who you must be

  • You have a BS/MS/PhD in a scientific field
  • You have a track record as an engineer in the operations of a large site
  • You value correctness and efficiency; you leave no stone unturned when diagnosing production issues
  • You handle infrastructure with code because automation lets you focus on the more difficult and rewarding problems
  • You have production experience with distributed compute/storage tools, e.g. zookeeper, cassandra, postgres, kafka, elasticsearch redis

Bonus Points

  • You have submitted bug fixes to the aforementioned projects
  • You are fully fluent in python, ruby and go

Questions

There are no answered questions, sign up or login to ask a question

sign up or login to save this job and more
New York, New York
Skills Desired
Sign up or login to see how your skills match up.
  • Infrastructure
  • Operations
  • Python
  • Ruby
  • Apache Cassandra
  • Apache Kafka
  • Automation
  • Distributed Computing
  • ElasticSearch
  • Go
  • Kernel
  • PostgreSQL Programming
  • Redis
  • Apache Zookeeper

Want to see jobs that are matched to you?

DreamHire recommends you jobs that fit your
skills, experiences, career goals, and more.