Website Datagon AI GmbH
About Us:
We are a dynamic industrial AI startup based in Munich with a young and motivated team, gaining significant traction in the industry. Our focus is on innovation and rapid growth, providing an exciting environment for professionals to thrive. To learn more about our company’s mission and values, visit our About Us page.
Position Overview:
We are seeking a highly motivated and versatile Site Reliability Engineer (SRE) to join our team. This role focuses primarily on ensuring the reliability, scalability, and performance of our software systems. In addition to your SRE and DevOps tasks, you will also be responsible for developing new features and enhancing existing functionality across the software stack. If you don’t have experience in some of these areas, a quick learning capability is essential. Ideal start date: November 2024
Responsibilities:
- Ensure the availability, performance, and scalability of our systems.
- Develop, maintain, and improve CI/CD pipelines and server infrastructure.
- Design, implement, and manage monitoring and alerting systems.
- Automate routine tasks, streamline infrastructure, and minimize downtime.
- Troubleshoot incidents, conduct root cause analysis, and resolve issues promptly.
- Develop new features and contribute to feature enhancements across the software stack.
- Perform infrastructure as code (IaC) development and administration using tools like Terraform or AWS CloudFormation.
- Engage in disaster recovery planning, backup strategies, and high availability systems.
What We Offer:
- High responsibility and the opportunity to make a significant impact on the reliability, efficiency, and development of our software.
- An innovative and energetic team.
- Significant traction in the industry.
- Fast-track professional development with potential leadership opportunities.
- A hybrid work policy with a focus on our office in Munich. We also offer a “soft landing” for applicants not currently living here by providing an apartment for up to six months.
- Visibility to the Who-is-Who of industry, startup, and software.
- A competitive compensation package.
- E-Gym Wellpass Membership and office in incubator with Gym / Climbing Hall / Pool / Spa (all included) within a 5-minute walk.
Ideal Background:
- Strong experience in Site Reliability Engineering (SRE), DevOps, or Infrastructure Engineering.
- Familiarity with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack).
- Ability to design scalable, reliable systems with a strong focus on uptime.
- Experience developing new features in addition to contributing to full-stack software development.
- Experience with Infrastructure as Code (IaC) tools like Terraform or CloudFormation.
- Comfortable working in fast-paced, semi-structured environments.
- Strong motivation to learn and adapt quickly.
- Experience in startups or dynamic, high-growth teams is a plus.
- Degree in Computer Science, Software Engineering, or a related field.
- Fluent in English (fluency in German is a plus).
- Highly communicative, versatile, and collaborative.
Our Tech Stack:
- Docker, GitHub Actions, AWS
- Kubernetes, Terraform, Prometheus, Grafana
- React (JS/Native), Node.js, TypeScript
- Jest, Playwright
- Ideally, PyTorch, Sklearn, and MLOps
Interview Process:
- Initial introduction.
- Software challenge.
- Final interview.
We value diversity and encourage applications from all qualified candidates. If you are ready to take on a challenging and rewarding role, and you have a passion for learning and innovation, we would love to hear from you!
Submit your application, including your CV and any references or GitHub projects here or to careers@datagon.ai
To apply for this job please visit datagon.ai.