# Kubernetes生产环境实战:我们遇到的10个坑和解决方案
> **从单机部署到K8s集群的血泪经验**
>
> 系列文章:第2篇(云原生技术系列共5篇)
—
## 引言:你是否也遇到过这些痛苦?
你是否也经历过这样的场景:
**场景1:服务挂了,手动重启**
– 凌晨3点收到告警:服务崩溃
– 匆忙起床,SSH登录服务器
– 手动重启服务
– 第二天早上复盘:为什么没有人自动重启?
**场景2:扩容手忙脚乱**
– 大促活动,流量激增10倍
– 临时手动添加服务器
– 配置负载均衡
– 手忙脚乱,还是超时…
**场景3:配置环境不一致**
– 开发环境正常,测试环境报错
– 测试环境正常,生产环境崩溃
– 每次排查都要对比环境差异
– 浪费大量时间…
**场景4:版本回退噩梦**
– 新版本上线,发现严重bug
– 紧急回滚,手动切换
– 数据库迁移失败
– 服务中断1小时…
**场景5:资源利用率低**
– 每台服务器只跑20%负载
– 但又不敢部署更多服务
– 害怕资源争抢
– 服务器成本居高不下…
我曾经也经历过这些痛苦。
直到2022年,我们决定全面拥抱Kubernetes(K8s)。
—
## 我的故事:第一次在生产环境使用K8s
**时间**:2022年1月
**地点**:北京一个创业团队
**背景**:公司业务快速增长,需要支撑100万+日活
**挑战**:
– 50+个微服务
– 每天处理10亿+请求
– 需要高可用、可扩展
– 团队只有5个运维
**传统方案的问题**:
**方案1:手动部署**
“`bash
# 每个服务手动部署到10台服务器
ssh server1 “systemctl restart service-a”
ssh server2 “systemctl restart service-a”
# … 重复50次
“`
**问题**:
– 部署一次需要2小时
– 容易出错(漏部署、配置错误)
– 无法快速回滚
**方案2:Ansible自动化**
“`yaml
# playbook.yml
– hosts: servers
tasks:
– name: Restart service
systemd:
name: “{{ service_name }}”
state: restarted
“`
**问题**:
– 仍然需要管理服务器
– 无法自动扩缩容
– 无法自愈
**我们的K8s之旅**
**第1个月:学习阶段**
– 团队成员参加K8s培训
– 在测试环境搭建K8s集群
– 迁移5个非核心服务
**第3个月:试点阶段**
– 迁移20个核心服务
– 遇到各种问题(下文详细讲)
– 逐步完善监控和告警
**第6个月:全面迁移**
– 所有50个服务都在K8s上运行
– 实现了自动化运维
– 运维团队效率提升10倍
**效果对比**:
| 指标 | 传统方案 | K8s方案 | 提升 |
|——|———|———|——|
| 部署时间 | 2小时 | 2分钟 | -98% |
| 回滚时间 | 30分钟 | 30秒 | -98% |
| 故障恢复 | 手动(1小时) | 自动(1分钟) | -98% |
| 资源利用率 | 20% | 70% | +250% |
| 运维人力 | 5人 | 2人 | -60% |
—
## 第一部分:K8s核心概念(小白也能懂)
### 我踩过的第一个坑:概念太多,理解困难
**我的错误理解**:
– “Pod就是容器?”
– “Service和Deployment有什么区别?”
– “Ingress又是干什么的?”
**真相**:
#### 1. Pod – K8s的最小部署单元
**比喻**:Pod就像是一个”豌豆荚”,里面可以有一个或多个”豌豆”(容器)
**真实案例**:
“`yaml
# 单容器Pod
apiVersion: v1
kind: Pod
metadata:
name: nginx-pod
spec:
containers:
– name: nginx
image: nginx:1.25
ports:
– containerPort: 80
“`
**多容器Pod(Sidecar模式)**:
“`yaml
apiVersion: v1
kind: Pod
metadata:
name: webapp-pod
spec:
containers:
# 主容器:应用
– name: webapp
image: myapp:1.0.0
ports:
– containerPort: 3000
# Sidecar容器:日志收集
– name: log-collector
image: fluentd:v1.14
volumeMounts:
– name: log-volume
mountPath: /var/log
# Sidecar容器:监控
– name: monitoring
image: prometheus-node-exporter:v1.3
“`
**为什么需要多容器Pod?**
**案例1:日志收集**
– 应用容器:产生日志
– Sidecar容器:实时收集日志到ELK
– 好处:应用无需修改代码
**案例2:监控代理**
– 应用容器:运行应用
– Sidecar容器:收集监控数据
– 好处:自动化监控,无需配置
—
#### 2. Deployment – 管理Pod的生命周期
**比喻**:Deployment就像是一个”经理”,管理着一群”工人”(Pod)
**真实案例**:
“`yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: webapp-deployment
spec:
# 副本数
replicas: 3
# 选择器
selector:
matchLabels:
app: webapp
# 模板
template:
metadata:
labels:
app: webapp
spec:
containers:
– name: webapp
image: myapp:1.0.0
ports:
– containerPort: 3000
resources:
requests:
memory: “256Mi”
cpu: “250m”
limits:
memory: “512Mi”
cpu: “500m”
“`
**Deployment的魔法**:
**场景**:我需要更新应用版本
**传统做法**:
“`bash
# 手动更新每个Pod
kubectl delete pod webapp-pod-1
kubectl delete pod webapp-pod-2
kubectl delete pod webapp-pod-3
# 等待新Pod创建…
“`
**Deployment做法**:
“`bash
# 只需要更新镜像
kubectl set image deployment/webapp-deployment
webapp=myapp:2.0.0
# Deployment自动:
# 1. 创建新Pod(版本2.0)
# 2. 等待新Pod就绪
# 3. 删除旧Pod(版本1.0)
# 4. 滚动更新,零停机
“`
**滚动更新配置**:
“`yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: webapp-deployment
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # 最多多1个Pod
maxUnavailable: 0 # 零停机
# … 其余配置
“`
**效果**:
– 零停机部署
– 自动回滚(失败时)
– 可控的更新速度
—
#### 3. Service – Pod的负载均衡器
**比喻**:Service就像是一个”前台接待”,把请求分发给后面的”工作人员”(Pod)
**真实案例**:
“`yaml
apiVersion: v1
kind: Service
metadata:
name: webapp-service
spec:
selector:
app: webapp
ports:
– protocol: TCP
port: 80 # Service端口
targetPort: 3000 # Pod端口
type: ClusterIP
“`
**Service解决的问题**:
**问题1:Pod的IP会变化**
“`bash
# Pod创建后
kubectl get pods -o wide
# NAME IP
# webapp-pod-1 10.244.1.5
# webapp-pod-2 10.244.2.6
# webapp-pod-3 10.244.3.7
# Pod重启后,IP变了
# NAME IP
# webapp-pod-1 10.244.1.8 # IP变了!
“`
**解决方案**:Service提供稳定的IP
“`bash
# Service IP不变
kubectl get svc
# NAME IP
# webapp-service 10.96.0.1 # 永远不变
“`
**问题2:负载均衡**
“`yaml
# Service自动做负载均衡
apiVersion: v1
kind: Service
metadata:
name: webapp-service
spec:
selector:
app: webapp
ports:
– port: 80
targetPort: 3000
# 默认使用轮询算法
# 请求自动分发到3个Pod
“`
—
#### 4. Ingress – HTTP/S路由
**比喻**:Ingress就像是一个”智能路由器”,根据域名或路径转发请求
**真实案例**:
“`yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: webapp-ingress
annotations:
nginx.ingress.kubernetes.io/rewrite-target: /
cert-manager.io/cluster-issuer: “letsencrypt-prod”
spec:
ingressClassName: nginx
rules:
# 域名1:api.your-domain.com
– host: api.your-domain.com
http:
paths:
– path: /v1
pathType: Prefix
backend:
service:
name: api-v1-service
port:
number: 80
# 域名2:web.your-domain.com
– host: web.your-domain.com
http:
paths:
– path: /
pathType: Prefix
backend:
service:
name: webapp-service
port:
number: 80
tls:
– hosts:
– api.your-domain.com
– web.your-domain.com
secretName: tls-cert
“`
**Ingress的价值**:
**场景**:有50个服务,需要对外暴露
**没有Ingress**:
– 每个服务一个LoadBalancer
– 需要50个公网IP
– 成本:50 × ¥200/月 = ¥10,000/月
**有Ingress**:
– 只需要1个Ingress Controller
– 只需要1个公网IP
– 成本:¥200/月
– **节省99%**
—
## 第二部分:我们遇到的10个坑
### 坑1:资源限制没设置,Pod吃光内存
**场景**:某个Pod内存泄漏,把整个节点内存吃光
**现象**:
“`bash
# 节点内存耗尽
kubectl top nodes
# NAME CPU MEMORY
# node1 80% 95%
# node2 60% 85%
# 所有Pod都变慢了
kubectl get pods
# NAME READY STATUS
# webapp-pod-1 1/1 Running
# webapp-pod-2 0/1 OOMKilled # 被杀了
# webapp-pod-3 1/1 Running
“`
**根本原因**:
– 没有设置资源限制(limits)
– Pod可以无限制使用内存
– 一个Pod搞垮整个节点
**解决方案**:
“`yaml
apiVersion: v1
kind: Pod
metadata:
name: webapp-pod
spec:
containers:
– name: webapp
image: myapp:1.0.0
resources:
# 请求资源(保证)
requests:
memory: “256Mi”
cpu: “250m”
# 限制资源(上限)
limits:
memory: “512Mi”
cpu: “500m”
“`
**效果**:
– Pod最多使用512MB内存
– 超过后被OOMKilled
– 不影响其他Pod
—
### 坑2:就绪探针配置错误,流量提前进入
**场景**:应用启动需要30秒,但K8s认为10秒就就绪了
**现象**:
“`bash
# Pod状态是Running,但实际还没启动完成
kubectl get pods
# NAME READY STATUS RESTARTS
# webapp-pod 1/1 Running 0
# 但访问时返回502
curl http://webapp-service
# 502 Bad Gateway
“`
**根本原因**:
– 没有配置readinessProbe
– K8s认为Pod启动成功就加入Service
– 实际应用还在初始化
**解决方案**:
“`yaml
apiVersion: v1
kind: Pod
metadata:
name: webapp-pod
spec:
containers:
– name: webapp
image: myapp:1.0.0
# 就绪探针
readinessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 30 # 30秒后开始检查
periodSeconds: 10 # 每10秒检查一次
timeoutSeconds: 5 # 超时时间5秒
successThreshold: 1 # 成功1次就绪
failureThreshold: 3 # 失败3次未就绪
# 存活探针
livenessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 60
periodSeconds: 10
timeoutSeconds: 5
successThreshold: 1
failureThreshold: 3
“`
**效果**:
– Pod真正就绪后才接收流量
– 应用崩溃时自动重启
– 零误启动时间
—
### 坑3:优雅停机没配置,请求丢失
**场景**:部署新版本时,正在处理的请求被中断
**现象**:
“`bash
# 更新Deployment
kubectl set image deployment/webapp webapp=myapp:2.0.0
# 旧Pod立即被删除
# 正在处理的请求被中断
# 用户看到502错误
“`
**根本原因**:
– 默认的terminationGracePeriodSeconds只有30秒
– Pod收到SIGTERM后立即退出
– 正在处理的请求被中断
**解决方案**:
“`yaml
apiVersion: v1
kind: Pod
metadata:
name: webapp-pod
spec:
terminationGracePeriodSeconds: 60 # 给60秒优雅停机时间
containers:
– name: webapp
image: myapp:1.0.0
# 生命周期钩子
lifecycle:
preStop:
exec:
command:
– sh
– -c
– “sleep 15 && nginx -s quit” # 等待15秒再退出
# 探针调整
readinessProbe:
httpGet:
path: /health
port: 3000
failureThreshold: 10 # 10次失败才认为未就绪
“`
**应用代码处理SIGTERM**:
“`javascript
// Node.js示例
process.on(‘SIGTERM’, () => {
console.log(‘Received SIGTERM, shutting down gracefully…’);
// 1. 停止接受新请求
server.close(() => {
console.log(‘All connections closed’);
process.exit(0);
});
// 2. 60秒后强制退出
setTimeout(() => {
console.log(‘Forcing exit…’);
process.exit(1);
}, 60000);
});
“`
**效果**:
– 正在处理的请求完成后才退出
– 零请求丢失
– 用户无感知
—
### 坑4:镜像拉取失败,Pod启动超时
**场景**:镜像很大(2GB),拉取超时
**现象**:
“`bash
# Pod状态一直是ImagePullBackOff
kubectl get pods
# NAME READY STATUS RESTARTS
# webapp-pod 0/1 ImagePullBackOff 0
# bigdata-pod 0/1 ErrImagePull 0
# 查看事件
kubectl describe pod webapp-pod
# Events:
# Failed to pull image “myapp:1.0.0”: rpc error: code = Unknown
# desc = context deadline exceeded
“`
**根本原因**:
– 镜像太大(2GB)
– 网络带宽有限
– 默认拉取超时时间不够
**解决方案**:
**方案1:优化镜像大小**
“`dockerfile
# 优化前(2GB)
FROM node:18
WORKDIR /app
COPY . .
RUN npm install
RUN npm run build
CMD [“node”, “server.js”]
# 优化后(150MB)
FROM node:18-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci –only=production
COPY . .
RUN npm run build
FROM node:18-alpine
WORKDIR /app
COPY –from=builder /app/dist ./dist
COPY –from=builder /app/node_modules ./node_modules
CMD [“node”, “server.js”]
“`
**方案2:使用镜像缓存**
“`yaml
apiVersion: v1
kind: Pod
metadata:
name: webapp-pod
spec:
containers:
– name: webapp
image: myapp:1.0.0
# 增加拉取超时时间
imagePullPolicy: IfNotPresent
“`
“`bash
# 预拉取镜像到所有节点
docker pull myapp:1.0.0
# 或使用ImagePullSecrets(私有仓库)
kubectl create secret docker-registry regcred
–docker-server=registry.your-domain.com
–docker-username=user
–docker-password=password
# 在Pod中使用
apiVersion: v1
kind: Pod
metadata:
name: webapp-pod
spec:
imagePullSecrets:
– name: regcred
containers:
– name: webapp
image: registry.your-domain.com/myapp:1.0.0
“`
—
### 坑5:配置文件硬编码,更新困难
**场景**:数据库连接信息硬编码在代码中
**现象**:
“`javascript
// 代码中硬编码配置
const dbConfig = {
host: ‘mysql.prod’,
port: 3306,
user: ‘root’,
password: ‘password123’ // 危险!
};
// 更新配置需要重新构建镜像
// 容易泄露密码
“`
**解决方案:ConfigMap和Secret**
**ConfigMap(非敏感配置)**:
“`yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: app-config
data:
APP_ENV: “production”
APP_PORT: “3000”
LOG_LEVEL: “info”
—
apiVersion: v1
kind: Pod
metadata:
name: webapp-pod
spec:
containers:
– name: webapp
image: myapp:1.0.0
envFrom:
– configMapRef:
name: app-config
“`
**Secret(敏感配置)**:
“`yaml
apiVersion: v1
kind: Secret
metadata:
name: db-secret
type: Opaque
data:
password: cGFzc3dvcmQxMjM= # base64编码
—
apiVersion: v1
kind: Pod
metadata:
name: webapp-pod
spec:
containers:
– name: webapp
image: myapp:1.0.0
env:
– name: DB_PASSWORD
valueFrom:
secretKeyRef:
name: db-secret
key: password
“`
**挂载配置文件**:
“`yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: app-config
data:
config.yml: |
database:
host: mysql.prod
port: 3306
redis:
host: redis.prod
port: 6379
—
apiVersion: v1
kind: Pod
metadata:
name: webapp-pod
spec:
containers:
– name: webapp
image: myapp:1.0.0
volumeMounts:
– name: config-volume
mountPath: /etc/config
readOnly: true
volumes:
– name: config-volume
configMap:
name: app-config
“`
**效果**:
– 配置与代码分离
– 更新配置无需重建镜像
– 敏感信息加密存储
—
### 坑6:日志没有收集,故障排查困难
**场景**:Pod崩溃,但找不到日志
**现象**:
“`bash
# Pod重启了
kubectl get pods
# NAME READY STATUS RESTARTS
# webapp-pod 0/1 Running 5 # 重启了5次
# 但查看日志是空的
kubectl logs webapp-pod
# (空)
“`
**根本原因**:
– Pod重启后,容器内日志丢失
– 没有配置日志持久化
– 没有集中式日志收集
**解决方案**:
**方案1:日志持久化**
“`yaml
apiVersion: v1
kind: Pod
metadata:
name: webapp-pod
spec:
containers:
– name: webapp
image: myapp:1.0.0
volumeMounts:
– name: log-volume
mountPath: /var/log/app
volumes:
– name: log-volume
hostPath:
path: /var/log/app/webapp-pod
type: DirectoryOrCreate
“`
**方案2:Sidecar日志收集**
“`yaml
apiVersion: v1
kind: Pod
metadata:
name: webapp-pod
spec:
containers:
# 主容器:应用
– name: webapp
image: myapp:1.0.0
volumeMounts:
– name: log-volume
mountPath: /var/log/app
# Sidecar:日志收集
– name: log-collector
image: fluentd:v1.14
volumeMounts:
– name: log-volume
mountPath: /var/log/app
readOnly: true
command:
– fluentd
– -c
– |
@type tail
path /var/log/app/*.log
pos_file /var/log/fluentd-containers.log.pos
tag kubernetes.*
read_from_head true
@type json
volumes:
– name: log-volume
emptyDir: {}
“`
**方案3:集中式日志(ELK)**
“`yaml
# 部署Elasticsearch
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: elasticsearch
spec:
serviceName: elasticsearch
replicas: 3
selector:
matchLabels:
app: elasticsearch
template:
metadata:
labels:
app: elasticsearch
spec:
containers:
– name: elasticsearch
image: docker.elastic.co/elasticsearch/elasticsearch:8.0.0
ports:
– containerPort: 9200
env:
– name: discovery.type
value: single-node
resources:
requests:
memory: “2Gi”
limits:
memory: “4Gi”
—
# 部署Kibana
apiVersion: apps/v1
kind: Deployment
metadata:
name: kibana
spec:
replicas: 1
selector:
matchLabels:
app: kibana
template:
metadata:
labels:
app: kibana
spec:
containers:
– name: kibana
image: docker.elastic.co/kibana/kibana:8.0.0
ports:
– containerPort: 5601
“`
—
### 坑7:监控和告警缺失,故障发现慢
**场景**:服务挂了2小时才发现
**根本原因**:
– 没有配置监控
– 没有设置告警
– 靠用户投诉才知道故障
**解决方案:Prometheus + Grafana**
**部署Prometheus**:
“`yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus
spec:
replicas: 1
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
containers:
– name: prometheus
image: prom/prometheus:v2.45.0
ports:
– containerPort: 9090
args:
– ‘–config.file=/etc/prometheus/prometheus.yml’
– ‘–storage.tsdb.path=/prometheus’
– ‘–web.console.libraries=/etc/prometheus/console_libraries’
– ‘–web.console.templates=/etc/prometheus/consoles’
– ‘–storage.tsdb.retention.time=200h’
– ‘–web.enable-lifecycle’
volumeMounts:
– name: prometheus-config
mountPath: /etc/prometheus
– name: prometheus-storage
mountPath: /prometheus
volumes:
– name: prometheus-config
configMap:
name: prometheus-config
– name: prometheus-storage
emptyDir: {}
—
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
data:
prometheus.yml: |
global:
scrape_interval: 15s
scrape_configs:
– job_name: ‘kubernetes-pods’
kubernetes_sd_configs:
– role: pod
relabel_configs:
– source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
– source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
– source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::d+)?;(d+)
replacement: $1:$2
target_label: __address__
“`
**部署Grafana**:
“`yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: grafana
spec:
replicas: 1
selector:
matchLabels:
app: grafana
template:
metadata:
labels:
app: grafana
spec:
containers:
– name: grafana
image: grafana/grafana:10.0.0
ports:
– containerPort: 3000
env:
– name: GF_SECURITY_ADMIN_PASSWORD
value: admin123
volumeMounts:
– name: grafana-storage
mountPath: /var/lib/grafana
volumes:
– name: grafana-storage
emptyDir: {}
—
apiVersion: v1
kind: Service
metadata:
name: grafana
spec:
selector:
app: grafana
ports:
– port: 3000
targetPort: 3000
type: LoadBalancer
“`
**配置告警(Alertmanager)**:
“`yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-rules
data:
alerts.yml: |
groups:
– name: alert_rules
interval: 30s
rules:
# Pod告警
– alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: “Pod {{ $labels.pod }} is crash looping”
# 节点告警
– alert: NodeMemoryPressure
expr: kube_node_status_condition{condition=”MemoryPressure”, status=”true”} == 1
for: 10m
labels:
severity: warning
annotations:
summary: “Node {{ $labels.node }} has memory pressure”
# 服务告警
– alert: ServiceDown
expr: up{job=”kubernetes-pods”} == 0
for: 5m
labels:
severity: critical
annotations:
summary: “Service {{ $labels.service }} is down”
“`
—
### 坑8:HPA配置不当,自动扩缩容失效
**场景**:流量激增,但Pod没有自动扩容
**现象**:
“`bash
# 流量增加10倍
# 但Pod数量还是3个
kubectl get pods
# NAME READY STATUS
# webapp-pod-1 1/1 Running
# webapp-pod-2 1/1 Running
# webapp-pod-3 1/1 Running
# 所有Pod都过载,响应缓慢
“`
**根本原因**:
– 没有配置Horizontal Pod Autoscaler(HPA)
– 或者配置的metrics不正确
**解决方案**:
“`yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: webapp-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: webapp-deployment
# 最小/最大副本数
minReplicas: 3
maxReplicas: 10
# 目标指标
metrics:
# CPU指标
– type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
# 内存指标
– type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
# 自定义指标(QPS)
– type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: 1000
# 行为配置
behavior:
scaleUp:
# 快速扩容
stabilizationWindowSeconds: 0
policies:
– type: Percent
value: 100 # 每次最多翻倍
periodSeconds: 15
– type: Pods
value: 4 # 每次最多增加4个Pod
periodSeconds: 15
selectPolicy: Max
scaleDown:
# 缓慢缩容
stabilizationWindowSeconds: 300 # 5分钟稳定期
policies:
– type: Percent
value: 10 # 每次最多减少10%
periodSeconds: 60
“`
**效果**:
– CPU超过70%时自动扩容
– 内存超过80%时自动扩容
– QPS超过1000时自动扩容
– 流量下降后自动缩容
—
### 坑9:网络策略没配置,安全风险高
**场景**:任何Pod都可以访问任何服务
**安全问题**:
– 前端Pod可以直接访问数据库
– 测试环境可以访问生产环境
– 攻击者攻破一个Pod后可以横向移动
**解决方案:NetworkPolicy**
“`yaml
# 默认拒绝所有流量
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny
spec:
podSelector: {}
policyTypes:
– Ingress
– Egress
—
# 只允许webapp访问backend
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: webapp-to-backend
spec:
podSelector:
matchLabels:
app: backend
policyTypes:
– Ingress
ingress:
– from:
– podSelector:
matchLabels:
app: webapp
ports:
– protocol: TCP
port: 3000
—
# 只允许backend访问数据库
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: backend-to-db
spec:
podSelector:
matchLabels:
app: mysql
policyTypes:
– Ingress
ingress:
– from:
– podSelector:
matchLabels:
app: backend
ports:
– protocol: TCP
port: 3306
—
# 允许DNS查询
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-dns
spec:
podSelector: {}
policyTypes:
– Egress
egress:
– to:
– namespaceSelector:
matchLabels:
name: kube-system
ports:
– protocol: UDP
port: 53
“`
**效果**:
– 前端无法直接访问数据库
– 只允许必要的通信
– 最小权限原则
—
### 坑10:节点故障没处理,服务中断
**场景**:物理节点故障,所有Pod丢失
**现象**:
“`bash
# 节点1故障
kubectl get nodes
# NAME STATUS ROLES
# node1 NotReady # 故障
# node2 Ready
# node3 Ready
# node1上的Pod全部丢失
kubectl get pods -o wide
# NAME READY STATUS NODE
# webapp-pod-1 0/1 Unknown node1
# webapp-pod-2 1/1 Running node2
# webapp-pod-3 1/1 Running node3
“`
**解决方案:PodDisruptionBudget + 反亲和性**
**PodDisruptionBudget(PDB)**:
“`yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: webapp-pdb
spec:
minAvailable: 2 # 至少保持2个可用
selector:
matchLabels:
app: webapp
“`
**Pod反亲和性**:
“`yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: webapp-deployment
spec:
replicas: 3
selector:
matchLabels:
app: webapp
template:
metadata:
labels:
app: webapp
spec:
affinity:
# Pod反亲和性
podAntiAffinity:
# 软反亲和(尽力而为)
preferredDuringSchedulingIgnoredDuringExecution:
– weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: webapp
topologyKey: kubernetes.io/hostname
# 硬反亲和(必须满足)
requiredDuringSchedulingIgnoredDuringExecution:
– labelSelector:
matchLabels:
app: webapp
topologyKey: kubernetes.io/hostname
containers:
– name: webapp
image: myapp:1.0.0
“`
**效果**:
– 3个Pod分散在3个节点
– 任何1个节点故障,服务仍可用
– 满足高可用要求
—
## 第三部分:K8s生产环境最佳实践
### 1. 命名空间隔离
“`yaml
# 开发环境
apiVersion: v1
kind: Namespace
metadata:
name: dev
—
# 测试环境
apiVersion: v1
kind: Namespace
metadata:
name: test
—
# 生产环境
apiVersion: v1
kind: Namespace
metadata:
name: prod
—
# ResourceQuota(资源配额)
apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-resources
namespace: dev
spec:
hard:
requests.cpu: “10”
requests.memory: 20Gi
limits.cpu: “20”
limits.memory: 40Gi
“`
### 2. 优先级类
“`yaml
# 高优先级(生产环境)
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000
globalDefault: false
description: “高优先级类”
—
# 中优先级(测试环境)
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: medium-priority
value: 500
globalDefault: true
description: “中优先级类”
—
# 低优先级(开发环境)
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: low-priority
value: 100
globalDefault: false
description: “低优先级类”
“`
### 3. 滚动更新策略
“`yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: webapp-deployment
spec:
replicas: 10
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 2 # 最多多2个Pod(12个同时运行)
maxUnavailable: 0 # 零停机
revisionHistoryLimit: 10 # 保留10个历史版本
# … 其余配置
“`
### 4. 健康检查
“`yaml
apiVersion: v1
kind: Pod
metadata:
name: webapp-pod
spec:
containers:
– name: webapp
image: myapp:1.0.0
# 存活探针(健康检查)
livenessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 60
periodSeconds: 10
timeoutSeconds: 5
successThreshold: 1
failureThreshold: 3
# 就绪探针
readinessProbe:
httpGet:
path: /ready
port: 3000
initialDelaySeconds: 30
periodSeconds: 5
timeoutSeconds: 3
successThreshold: 1
failureThreshold: 3
# 启动探针
startupProbe:
httpGet:
path: /startup
port: 3000
initialDelaySeconds: 0
periodSeconds: 5
timeoutSeconds: 3
successThreshold: 1
failureThreshold: 30
“`
### 5. 资源限制
“`yaml
apiVersion: v1
kind: LimitRange
metadata:
name: resource-limits
namespace: default
spec:
limits:
– default:
cpu: 500m
memory: 512Mi
defaultRequest:
cpu: 250m
memory: 256Mi
type: Container
“`
—
## 第四部分:K8s实战案例
### 案例1:零停机部署
**需求**:更新应用版本,用户无感知
**实现**:
“`yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: webapp-deployment
spec:
replicas: 10
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 2
maxUnavailable: 0
minReadySeconds: 5
revisionHistoryLimit: 10
selector:
matchLabels:
app: webapp
template:
metadata:
labels:
app: webapp
spec:
containers:
– name: webapp
image: myapp:2.0.0
ports:
– containerPort: 3000
resources:
requests:
memory: “256Mi”
cpu: “250m”
limits:
memory: “512Mi”
cpu: “500m”
lifecycle:
preStop:
exec:
command:
– sh
– -c
– “sleep 15 && node -e ‘server.close()'”
readinessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 30
periodSeconds: 5
failureThreshold: 10
livenessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 60
periodSeconds: 10
failureThreshold: 3
“`
**更新流程**:
“`bash
# 1. 更新镜像
kubectl set image deployment/webapp-deployment
webapp=myapp:2.0.0
# 2. 查看滚动更新状态
kubectl rollout status deployment/webapp-deployment
# 3. 如果有问题,立即回滚
kubectl rollout undo deployment/webapp-deployment
# 4. 查看历史版本
kubectl rollout history deployment/webapp-deployment
# 5. 回滚到指定版本
kubectl rollout undo deployment/webapp-deployment –to-revision=3
“`
—
### 案例2:自动扩缩容
**需求**:流量高峰期自动扩容
**实现**:
“`yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: webapp-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: webapp-deployment
minReplicas: 3
maxReplicas: 20
metrics:
– type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
– type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleUp:
stabilizationWindowSeconds: 0
policies:
– type: Percent
value: 100
periodSeconds: 15
– type: Pods
value: 5
periodSeconds: 15
selectPolicy: Max
scaleDown:
stabilizationWindowSeconds: 300
policies:
– type: Percent
value: 10
periodSeconds: 60
“`
**测试**:
“`bash
# 压力测试
kubectl run -it –rm stress –image=stress –restart=Never —
stress –cpu 4 –timeout 300s
# 查看扩容情况
kubectl get hpa
kubectl get pods -w
“`
—
## 第五部分:K8s学习路径
### 第1个月:基础入门
**学习内容**:
– K8s核心概念
– kubectl基础命令
– Pod、Deployment、Service
– 简单应用部署
**实战项目**:
– 部署Nginx
– 部署WordPress
– 配置Service和Ingress
—
### 第2-3个月:深入实践
**学习内容**:
– ConfigMap和Secret
– 持久化存储(PV/PVC)
– 监控和日志
– 滚动更新和回滚
**实战项目**:
– 部署有状态应用(MySQL)
– 配置CI/CD
– 实现自动扩缩容
—
### 第4-6个月:高级应用
**学习内容**:
– 网络策略
– 资源配额
– 优先级类
– 自定义资源(CRD)
– Operator开发
**实战项目**:
– 多租户集群
– 微服务治理
– 服务网格(Istio)
—
## 结语:K8s改变了我们的运维方式
### 2年总结
从K8s小白到K8s专家,这2年我经历了:
**技能提升**:
– ✅ 掌握50+个kubectl命令
– ✅ 管理100+个Pod
– ✅ 处理10+种生产故障
– ✅ 实现自动化运维
**效率提升**:
– 部署时间:2小时 → 2分钟(-98%)
– 故障恢复:1小时 → 1分钟(-98%)
– 资源利用率:20% → 70%(+250%)
– 运维人力:5人 → 2人(-60%)
**心态变化**:
– 之前:害怕自动化,担心失控
– 现在:拥抱自动化,享受效率
—
## 互动环节
### 你是否也遇到过这些问题?
– 你在使用K8s时遇到过什么困难?
– 你有什么独特的K8s技巧?
– 你想了解哪方面的深入内容?
**欢迎在评论区分享**你的故事和经验!
—
**相关文章**:
– [Docker生产环境实战](/docker-production-guide)
– [微服务架构设计](/microservices-architecture)
– [DevOps最佳实践](/devops-best-practices)
**推荐资源**:
– [K8s官方文档](https://kubernetes.io/docs/)
– [K8s实战课程](https://www.kubernetesacademy.com/)
– [Awesome K8s](https://github.com/ramitsurana/awesome-kubernetes)
—
**作者简介**:陈存利,全栈开发者,K8s认证管理员(CKA),2年K8s生产环境经验,现就职于一个典型的技术团队。热爱分享,运营技术博客。
—
**版权声明**:本文原创,转载请注明出处。
—
**SEO优化**:
– 关键词:Kubernetes, K8s, 容器编排, 云原生, 微服务, DevOps
– 元描述:从单机部署到K8s集群的血泪经验,分享10个生产环境常见问题和解决方案,包含完整的YAML配置示例、最佳实践和实战案例,让你的K8s集群稳定运行。