Kubernetes生产环境实战:我们遇到的10个坑和解决方案

# Kubernetes生产环境实战:我们遇到的10个坑和解决方案

> **从单机部署到K8s集群的血泪经验**
>

> 系列文章:第2篇(云原生技术系列共5篇)

## 引言:你是否也遇到过这些痛苦?

你是否也经历过这样的场景:

**场景1:服务挂了,手动重启**
– 凌晨3点收到告警:服务崩溃
– 匆忙起床,SSH登录服务器
– 手动重启服务
– 第二天早上复盘:为什么没有人自动重启?

**场景2:扩容手忙脚乱**
– 大促活动,流量激增10倍
– 临时手动添加服务器
– 配置负载均衡
– 手忙脚乱,还是超时…

**场景3:配置环境不一致**
– 开发环境正常,测试环境报错
– 测试环境正常,生产环境崩溃
– 每次排查都要对比环境差异
– 浪费大量时间…

**场景4:版本回退噩梦**
– 新版本上线,发现严重bug
– 紧急回滚,手动切换
– 数据库迁移失败
– 服务中断1小时…

**场景5:资源利用率低**
– 每台服务器只跑20%负载
– 但又不敢部署更多服务
– 害怕资源争抢
– 服务器成本居高不下…

我曾经也经历过这些痛苦。

直到2022年,我们决定全面拥抱Kubernetes(K8s)。

## 我的故事:第一次在生产环境使用K8s

**时间**:2022年1月
**地点**:北京一个创业团队
**背景**:公司业务快速增长,需要支撑100万+日活

**挑战**:
– 50+个微服务
– 每天处理10亿+请求
– 需要高可用、可扩展
– 团队只有5个运维

**传统方案的问题**:

**方案1:手动部署**
“`bash
# 每个服务手动部署到10台服务器
ssh server1 “systemctl restart service-a”
ssh server2 “systemctl restart service-a”
# … 重复50次
“`
**问题**:
– 部署一次需要2小时
– 容易出错(漏部署、配置错误)
– 无法快速回滚

**方案2:Ansible自动化**
“`yaml
# playbook.yml
– hosts: servers
tasks:
– name: Restart service
systemd:
name: “{{ service_name }}”
state: restarted
“`
**问题**:
– 仍然需要管理服务器
– 无法自动扩缩容
– 无法自愈

**我们的K8s之旅**

**第1个月:学习阶段**
– 团队成员参加K8s培训
– 在测试环境搭建K8s集群
– 迁移5个非核心服务

**第3个月:试点阶段**
– 迁移20个核心服务
– 遇到各种问题(下文详细讲)
– 逐步完善监控和告警

**第6个月:全面迁移**
– 所有50个服务都在K8s上运行
– 实现了自动化运维
– 运维团队效率提升10倍

**效果对比**:

| 指标 | 传统方案 | K8s方案 | 提升 |
|——|———|———|——|
| 部署时间 | 2小时 | 2分钟 | -98% |
| 回滚时间 | 30分钟 | 30秒 | -98% |
| 故障恢复 | 手动(1小时) | 自动(1分钟) | -98% |
| 资源利用率 | 20% | 70% | +250% |
| 运维人力 | 5人 | 2人 | -60% |

## 第一部分:K8s核心概念(小白也能懂)

### 我踩过的第一个坑:概念太多,理解困难

**我的错误理解**:
– “Pod就是容器?”
– “Service和Deployment有什么区别?”
– “Ingress又是干什么的?”

**真相**:

#### 1. Pod – K8s的最小部署单元

**比喻**:Pod就像是一个”豌豆荚”,里面可以有一个或多个”豌豆”(容器)

**真实案例**:

“`yaml
# 单容器Pod
apiVersion: v1
kind: Pod
metadata:
name: nginx-pod
spec:
containers:
– name: nginx
image: nginx:1.25
ports:
– containerPort: 80
“`

**多容器Pod(Sidecar模式)**:

“`yaml
apiVersion: v1
kind: Pod
metadata:
name: webapp-pod
spec:
containers:
# 主容器:应用
– name: webapp
image: myapp:1.0.0
ports:
– containerPort: 3000

# Sidecar容器:日志收集
– name: log-collector
image: fluentd:v1.14
volumeMounts:
– name: log-volume
mountPath: /var/log

# Sidecar容器:监控
– name: monitoring
image: prometheus-node-exporter:v1.3
“`

**为什么需要多容器Pod?**

**案例1:日志收集**
– 应用容器:产生日志
– Sidecar容器:实时收集日志到ELK
– 好处:应用无需修改代码

**案例2:监控代理**
– 应用容器:运行应用
– Sidecar容器:收集监控数据
– 好处:自动化监控,无需配置

#### 2. Deployment – 管理Pod的生命周期

**比喻**:Deployment就像是一个”经理”,管理着一群”工人”(Pod)

**真实案例**:

“`yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: webapp-deployment
spec:
# 副本数
replicas: 3

# 选择器
selector:
matchLabels:
app: webapp

# 模板
template:
metadata:
labels:
app: webapp
spec:
containers:
– name: webapp
image: myapp:1.0.0
ports:
– containerPort: 3000
resources:
requests:
memory: “256Mi”
cpu: “250m”
limits:
memory: “512Mi”
cpu: “500m”
“`

**Deployment的魔法**:

**场景**:我需要更新应用版本

**传统做法**:
“`bash
# 手动更新每个Pod
kubectl delete pod webapp-pod-1
kubectl delete pod webapp-pod-2
kubectl delete pod webapp-pod-3
# 等待新Pod创建…
“`

**Deployment做法**:
“`bash
# 只需要更新镜像
kubectl set image deployment/webapp-deployment
webapp=myapp:2.0.0

# Deployment自动:
# 1. 创建新Pod(版本2.0)
# 2. 等待新Pod就绪
# 3. 删除旧Pod(版本1.0)
# 4. 滚动更新,零停机
“`

**滚动更新配置**:

“`yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: webapp-deployment
spec:
replicas: 3

strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # 最多多1个Pod
maxUnavailable: 0 # 零停机

# … 其余配置
“`

**效果**:
– 零停机部署
– 自动回滚(失败时)
– 可控的更新速度

#### 3. Service – Pod的负载均衡器

**比喻**:Service就像是一个”前台接待”,把请求分发给后面的”工作人员”(Pod)

**真实案例**:

“`yaml
apiVersion: v1
kind: Service
metadata:
name: webapp-service
spec:
selector:
app: webapp

ports:
– protocol: TCP
port: 80 # Service端口
targetPort: 3000 # Pod端口

type: ClusterIP
“`

**Service解决的问题**:

**问题1:Pod的IP会变化**

“`bash
# Pod创建后
kubectl get pods -o wide

# NAME IP
# webapp-pod-1 10.244.1.5
# webapp-pod-2 10.244.2.6
# webapp-pod-3 10.244.3.7

# Pod重启后,IP变了
# NAME IP
# webapp-pod-1 10.244.1.8 # IP变了!
“`

**解决方案**:Service提供稳定的IP

“`bash
# Service IP不变
kubectl get svc

# NAME IP
# webapp-service 10.96.0.1 # 永远不变
“`

**问题2:负载均衡**

“`yaml
# Service自动做负载均衡
apiVersion: v1
kind: Service
metadata:
name: webapp-service
spec:
selector:
app: webapp

ports:
– port: 80
targetPort: 3000

# 默认使用轮询算法
# 请求自动分发到3个Pod
“`

#### 4. Ingress – HTTP/S路由

**比喻**:Ingress就像是一个”智能路由器”,根据域名或路径转发请求

**真实案例**:

“`yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: webapp-ingress
annotations:
nginx.ingress.kubernetes.io/rewrite-target: /
cert-manager.io/cluster-issuer: “letsencrypt-prod”
spec:
ingressClassName: nginx

rules:
# 域名1:api.your-domain.com
– host: api.your-domain.com
http:
paths:
– path: /v1
pathType: Prefix
backend:
service:
name: api-v1-service
port:
number: 80

# 域名2:web.your-domain.com
– host: web.your-domain.com
http:
paths:
– path: /
pathType: Prefix
backend:
service:
name: webapp-service
port:
number: 80

tls:
– hosts:
– api.your-domain.com
– web.your-domain.com
secretName: tls-cert
“`

**Ingress的价值**:

**场景**:有50个服务,需要对外暴露

**没有Ingress**:
– 每个服务一个LoadBalancer
– 需要50个公网IP
– 成本:50 × ¥200/月 = ¥10,000/月

**有Ingress**:
– 只需要1个Ingress Controller
– 只需要1个公网IP
– 成本:¥200/月
– **节省99%**

## 第二部分:我们遇到的10个坑

### 坑1:资源限制没设置,Pod吃光内存

**场景**:某个Pod内存泄漏,把整个节点内存吃光

**现象**:
“`bash
# 节点内存耗尽
kubectl top nodes

# NAME CPU MEMORY
# node1 80% 95%
# node2 60% 85%

# 所有Pod都变慢了
kubectl get pods

# NAME READY STATUS
# webapp-pod-1 1/1 Running
# webapp-pod-2 0/1 OOMKilled # 被杀了
# webapp-pod-3 1/1 Running
“`

**根本原因**:
– 没有设置资源限制(limits)
– Pod可以无限制使用内存
– 一个Pod搞垮整个节点

**解决方案**:

“`yaml
apiVersion: v1
kind: Pod
metadata:
name: webapp-pod
spec:
containers:
– name: webapp
image: myapp:1.0.0
resources:
# 请求资源(保证)
requests:
memory: “256Mi”
cpu: “250m”

# 限制资源(上限)
limits:
memory: “512Mi”
cpu: “500m”
“`

**效果**:
– Pod最多使用512MB内存
– 超过后被OOMKilled
– 不影响其他Pod

### 坑2:就绪探针配置错误,流量提前进入

**场景**:应用启动需要30秒,但K8s认为10秒就就绪了

**现象**:
“`bash
# Pod状态是Running,但实际还没启动完成
kubectl get pods

# NAME READY STATUS RESTARTS
# webapp-pod 1/1 Running 0

# 但访问时返回502
curl http://webapp-service

# 502 Bad Gateway
“`

**根本原因**:
– 没有配置readinessProbe
– K8s认为Pod启动成功就加入Service
– 实际应用还在初始化

**解决方案**:

“`yaml
apiVersion: v1
kind: Pod
metadata:
name: webapp-pod
spec:
containers:
– name: webapp
image: myapp:1.0.0

# 就绪探针
readinessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 30 # 30秒后开始检查
periodSeconds: 10 # 每10秒检查一次
timeoutSeconds: 5 # 超时时间5秒
successThreshold: 1 # 成功1次就绪
failureThreshold: 3 # 失败3次未就绪

# 存活探针
livenessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 60
periodSeconds: 10
timeoutSeconds: 5
successThreshold: 1
failureThreshold: 3
“`

**效果**:
– Pod真正就绪后才接收流量
– 应用崩溃时自动重启
– 零误启动时间

### 坑3:优雅停机没配置,请求丢失

**场景**:部署新版本时,正在处理的请求被中断

**现象**:
“`bash
# 更新Deployment
kubectl set image deployment/webapp webapp=myapp:2.0.0

# 旧Pod立即被删除
# 正在处理的请求被中断
# 用户看到502错误
“`

**根本原因**:
– 默认的terminationGracePeriodSeconds只有30秒
– Pod收到SIGTERM后立即退出
– 正在处理的请求被中断

**解决方案**:

“`yaml
apiVersion: v1
kind: Pod
metadata:
name: webapp-pod
spec:
terminationGracePeriodSeconds: 60 # 给60秒优雅停机时间

containers:
– name: webapp
image: myapp:1.0.0

# 生命周期钩子
lifecycle:
preStop:
exec:
command:
– sh
– -c
– “sleep 15 && nginx -s quit” # 等待15秒再退出

# 探针调整
readinessProbe:
httpGet:
path: /health
port: 3000
failureThreshold: 10 # 10次失败才认为未就绪
“`

**应用代码处理SIGTERM**:

“`javascript
// Node.js示例
process.on(‘SIGTERM’, () => {
console.log(‘Received SIGTERM, shutting down gracefully…’);

// 1. 停止接受新请求
server.close(() => {
console.log(‘All connections closed’);
process.exit(0);
});

// 2. 60秒后强制退出
setTimeout(() => {
console.log(‘Forcing exit…’);
process.exit(1);
}, 60000);
});
“`

**效果**:
– 正在处理的请求完成后才退出
– 零请求丢失
– 用户无感知

### 坑4:镜像拉取失败,Pod启动超时

**场景**:镜像很大(2GB),拉取超时

**现象**:
“`bash
# Pod状态一直是ImagePullBackOff
kubectl get pods

# NAME READY STATUS RESTARTS
# webapp-pod 0/1 ImagePullBackOff 0
# bigdata-pod 0/1 ErrImagePull 0

# 查看事件
kubectl describe pod webapp-pod

# Events:
# Failed to pull image “myapp:1.0.0”: rpc error: code = Unknown
# desc = context deadline exceeded
“`

**根本原因**:
– 镜像太大(2GB)
– 网络带宽有限
– 默认拉取超时时间不够

**解决方案**:

**方案1:优化镜像大小**

“`dockerfile
# 优化前(2GB)
FROM node:18

WORKDIR /app
COPY . .
RUN npm install
RUN npm run build

CMD [“node”, “server.js”]

# 优化后(150MB)
FROM node:18-alpine AS builder

WORKDIR /app
COPY package*.json ./
RUN npm ci –only=production
COPY . .
RUN npm run build

FROM node:18-alpine

WORKDIR /app
COPY –from=builder /app/dist ./dist
COPY –from=builder /app/node_modules ./node_modules

CMD [“node”, “server.js”]
“`

**方案2:使用镜像缓存**

“`yaml
apiVersion: v1
kind: Pod
metadata:
name: webapp-pod
spec:
containers:
– name: webapp
image: myapp:1.0.0

# 增加拉取超时时间
imagePullPolicy: IfNotPresent
“`

“`bash
# 预拉取镜像到所有节点
docker pull myapp:1.0.0

# 或使用ImagePullSecrets(私有仓库)
kubectl create secret docker-registry regcred
–docker-server=registry.your-domain.com
–docker-username=user
–docker-password=password

# 在Pod中使用
apiVersion: v1
kind: Pod
metadata:
name: webapp-pod
spec:
imagePullSecrets:
– name: regcred
containers:
– name: webapp
image: registry.your-domain.com/myapp:1.0.0
“`

### 坑5:配置文件硬编码,更新困难

**场景**:数据库连接信息硬编码在代码中

**现象**:
“`javascript
// 代码中硬编码配置
const dbConfig = {
host: ‘mysql.prod’,
port: 3306,
user: ‘root’,
password: ‘password123’ // 危险!
};

// 更新配置需要重新构建镜像
// 容易泄露密码
“`

**解决方案:ConfigMap和Secret**

**ConfigMap(非敏感配置)**:

“`yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: app-config
data:
APP_ENV: “production”
APP_PORT: “3000”
LOG_LEVEL: “info”

apiVersion: v1
kind: Pod
metadata:
name: webapp-pod
spec:
containers:
– name: webapp
image: myapp:1.0.0
envFrom:
– configMapRef:
name: app-config
“`

**Secret(敏感配置)**:

“`yaml
apiVersion: v1
kind: Secret
metadata:
name: db-secret
type: Opaque
data:
password: cGFzc3dvcmQxMjM= # base64编码

apiVersion: v1
kind: Pod
metadata:
name: webapp-pod
spec:
containers:
– name: webapp
image: myapp:1.0.0
env:
– name: DB_PASSWORD
valueFrom:
secretKeyRef:
name: db-secret
key: password
“`

**挂载配置文件**:

“`yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: app-config
data:
config.yml: |
database:
host: mysql.prod
port: 3306
redis:
host: redis.prod
port: 6379

apiVersion: v1
kind: Pod
metadata:
name: webapp-pod
spec:
containers:
– name: webapp
image: myapp:1.0.0
volumeMounts:
– name: config-volume
mountPath: /etc/config
readOnly: true
volumes:
– name: config-volume
configMap:
name: app-config
“`

**效果**:
– 配置与代码分离
– 更新配置无需重建镜像
– 敏感信息加密存储

### 坑6:日志没有收集,故障排查困难

**场景**:Pod崩溃,但找不到日志

**现象**:
“`bash
# Pod重启了
kubectl get pods

# NAME READY STATUS RESTARTS
# webapp-pod 0/1 Running 5 # 重启了5次

# 但查看日志是空的
kubectl logs webapp-pod

# (空)
“`

**根本原因**:
– Pod重启后,容器内日志丢失
– 没有配置日志持久化
– 没有集中式日志收集

**解决方案**:

**方案1:日志持久化**

“`yaml
apiVersion: v1
kind: Pod
metadata:
name: webapp-pod
spec:
containers:
– name: webapp
image: myapp:1.0.0
volumeMounts:
– name: log-volume
mountPath: /var/log/app
volumes:
– name: log-volume
hostPath:
path: /var/log/app/webapp-pod
type: DirectoryOrCreate
“`

**方案2:Sidecar日志收集**

“`yaml
apiVersion: v1
kind: Pod
metadata:
name: webapp-pod
spec:
containers:
# 主容器:应用
– name: webapp
image: myapp:1.0.0
volumeMounts:
– name: log-volume
mountPath: /var/log/app

# Sidecar:日志收集
– name: log-collector
image: fluentd:v1.14
volumeMounts:
– name: log-volume
mountPath: /var/log/app
readOnly: true
command:
– fluentd
– -c
– |
@type tail
path /var/log/app/*.log
pos_file /var/log/fluentd-containers.log.pos
tag kubernetes.*
read_from_head true

@type json

volumes:
– name: log-volume
emptyDir: {}
“`

**方案3:集中式日志(ELK)**

“`yaml
# 部署Elasticsearch
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: elasticsearch
spec:
serviceName: elasticsearch
replicas: 3
selector:
matchLabels:
app: elasticsearch
template:
metadata:
labels:
app: elasticsearch
spec:
containers:
– name: elasticsearch
image: docker.elastic.co/elasticsearch/elasticsearch:8.0.0
ports:
– containerPort: 9200
env:
– name: discovery.type
value: single-node
resources:
requests:
memory: “2Gi”
limits:
memory: “4Gi”


# 部署Kibana
apiVersion: apps/v1
kind: Deployment
metadata:
name: kibana
spec:
replicas: 1
selector:
matchLabels:
app: kibana
template:
metadata:
labels:
app: kibana
spec:
containers:
– name: kibana
image: docker.elastic.co/kibana/kibana:8.0.0
ports:
– containerPort: 5601
“`

### 坑7:监控和告警缺失,故障发现慢

**场景**:服务挂了2小时才发现

**根本原因**:
– 没有配置监控
– 没有设置告警
– 靠用户投诉才知道故障

**解决方案:Prometheus + Grafana**

**部署Prometheus**:

“`yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus
spec:
replicas: 1
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
containers:
– name: prometheus
image: prom/prometheus:v2.45.0
ports:
– containerPort: 9090
args:
– ‘–config.file=/etc/prometheus/prometheus.yml’
– ‘–storage.tsdb.path=/prometheus’
– ‘–web.console.libraries=/etc/prometheus/console_libraries’
– ‘–web.console.templates=/etc/prometheus/consoles’
– ‘–storage.tsdb.retention.time=200h’
– ‘–web.enable-lifecycle’
volumeMounts:
– name: prometheus-config
mountPath: /etc/prometheus
– name: prometheus-storage
mountPath: /prometheus
volumes:
– name: prometheus-config
configMap:
name: prometheus-config
– name: prometheus-storage
emptyDir: {}

apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
data:
prometheus.yml: |
global:
scrape_interval: 15s

scrape_configs:
– job_name: ‘kubernetes-pods’
kubernetes_sd_configs:
– role: pod
relabel_configs:
– source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
– source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
– source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::d+)?;(d+)
replacement: $1:$2
target_label: __address__
“`

**部署Grafana**:

“`yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: grafana
spec:
replicas: 1
selector:
matchLabels:
app: grafana
template:
metadata:
labels:
app: grafana
spec:
containers:
– name: grafana
image: grafana/grafana:10.0.0
ports:
– containerPort: 3000
env:
– name: GF_SECURITY_ADMIN_PASSWORD
value: admin123
volumeMounts:
– name: grafana-storage
mountPath: /var/lib/grafana
volumes:
– name: grafana-storage
emptyDir: {}

apiVersion: v1
kind: Service
metadata:
name: grafana
spec:
selector:
app: grafana
ports:
– port: 3000
targetPort: 3000
type: LoadBalancer
“`

**配置告警(Alertmanager)**:

“`yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-rules
data:
alerts.yml: |
groups:
– name: alert_rules
interval: 30s
rules:
# Pod告警
– alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: “Pod {{ $labels.pod }} is crash looping”

# 节点告警
– alert: NodeMemoryPressure
expr: kube_node_status_condition{condition=”MemoryPressure”, status=”true”} == 1
for: 10m
labels:
severity: warning
annotations:
summary: “Node {{ $labels.node }} has memory pressure”

# 服务告警
– alert: ServiceDown
expr: up{job=”kubernetes-pods”} == 0
for: 5m
labels:
severity: critical
annotations:
summary: “Service {{ $labels.service }} is down”
“`

### 坑8:HPA配置不当,自动扩缩容失效

**场景**:流量激增,但Pod没有自动扩容

**现象**:
“`bash
# 流量增加10倍
# 但Pod数量还是3个
kubectl get pods

# NAME READY STATUS
# webapp-pod-1 1/1 Running
# webapp-pod-2 1/1 Running
# webapp-pod-3 1/1 Running

# 所有Pod都过载,响应缓慢
“`

**根本原因**:
– 没有配置Horizontal Pod Autoscaler(HPA)
– 或者配置的metrics不正确

**解决方案**:

“`yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: webapp-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: webapp-deployment

# 最小/最大副本数
minReplicas: 3
maxReplicas: 10

# 目标指标
metrics:
# CPU指标
– type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70

# 内存指标
– type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80

# 自定义指标(QPS)
– type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: 1000

# 行为配置
behavior:
scaleUp:
# 快速扩容
stabilizationWindowSeconds: 0
policies:
– type: Percent
value: 100 # 每次最多翻倍
periodSeconds: 15
– type: Pods
value: 4 # 每次最多增加4个Pod
periodSeconds: 15
selectPolicy: Max

scaleDown:
# 缓慢缩容
stabilizationWindowSeconds: 300 # 5分钟稳定期
policies:
– type: Percent
value: 10 # 每次最多减少10%
periodSeconds: 60
“`

**效果**:
– CPU超过70%时自动扩容
– 内存超过80%时自动扩容
– QPS超过1000时自动扩容
– 流量下降后自动缩容

### 坑9:网络策略没配置,安全风险高

**场景**:任何Pod都可以访问任何服务

**安全问题**:
– 前端Pod可以直接访问数据库
– 测试环境可以访问生产环境
– 攻击者攻破一个Pod后可以横向移动

**解决方案:NetworkPolicy**

“`yaml
# 默认拒绝所有流量
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny
spec:
podSelector: {}
policyTypes:
– Ingress
– Egress


# 只允许webapp访问backend
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: webapp-to-backend
spec:
podSelector:
matchLabels:
app: backend

policyTypes:
– Ingress

ingress:
– from:
– podSelector:
matchLabels:
app: webapp
ports:
– protocol: TCP
port: 3000


# 只允许backend访问数据库
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: backend-to-db
spec:
podSelector:
matchLabels:
app: mysql

policyTypes:
– Ingress

ingress:
– from:
– podSelector:
matchLabels:
app: backend
ports:
– protocol: TCP
port: 3306


# 允许DNS查询
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-dns
spec:
podSelector: {}

policyTypes:
– Egress

egress:
– to:
– namespaceSelector:
matchLabels:
name: kube-system
ports:
– protocol: UDP
port: 53
“`

**效果**:
– 前端无法直接访问数据库
– 只允许必要的通信
– 最小权限原则

### 坑10:节点故障没处理,服务中断

**场景**:物理节点故障,所有Pod丢失

**现象**:
“`bash
# 节点1故障
kubectl get nodes

# NAME STATUS ROLES
# node1 NotReady # 故障
# node2 Ready
# node3 Ready

# node1上的Pod全部丢失
kubectl get pods -o wide

# NAME READY STATUS NODE
# webapp-pod-1 0/1 Unknown node1
# webapp-pod-2 1/1 Running node2
# webapp-pod-3 1/1 Running node3
“`

**解决方案:PodDisruptionBudget + 反亲和性**

**PodDisruptionBudget(PDB)**:

“`yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: webapp-pdb
spec:
minAvailable: 2 # 至少保持2个可用

selector:
matchLabels:
app: webapp
“`

**Pod反亲和性**:

“`yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: webapp-deployment
spec:
replicas: 3

selector:
matchLabels:
app: webapp

template:
metadata:
labels:
app: webapp
spec:
affinity:
# Pod反亲和性
podAntiAffinity:
# 软反亲和(尽力而为)
preferredDuringSchedulingIgnoredDuringExecution:
– weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: webapp
topologyKey: kubernetes.io/hostname

# 硬反亲和(必须满足)
requiredDuringSchedulingIgnoredDuringExecution:
– labelSelector:
matchLabels:
app: webapp
topologyKey: kubernetes.io/hostname

containers:
– name: webapp
image: myapp:1.0.0
“`

**效果**:
– 3个Pod分散在3个节点
– 任何1个节点故障,服务仍可用
– 满足高可用要求

## 第三部分:K8s生产环境最佳实践

### 1. 命名空间隔离

“`yaml
# 开发环境
apiVersion: v1
kind: Namespace
metadata:
name: dev


# 测试环境
apiVersion: v1
kind: Namespace
metadata:
name: test


# 生产环境
apiVersion: v1
kind: Namespace
metadata:
name: prod


# ResourceQuota(资源配额)
apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-resources
namespace: dev
spec:
hard:
requests.cpu: “10”
requests.memory: 20Gi
limits.cpu: “20”
limits.memory: 40Gi
“`

### 2. 优先级类

“`yaml
# 高优先级(生产环境)
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000
globalDefault: false
description: “高优先级类”


# 中优先级(测试环境)
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: medium-priority
value: 500
globalDefault: true
description: “中优先级类”


# 低优先级(开发环境)
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: low-priority
value: 100
globalDefault: false
description: “低优先级类”
“`

### 3. 滚动更新策略

“`yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: webapp-deployment
spec:
replicas: 10

strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 2 # 最多多2个Pod(12个同时运行)
maxUnavailable: 0 # 零停机

revisionHistoryLimit: 10 # 保留10个历史版本

# … 其余配置
“`

### 4. 健康检查

“`yaml
apiVersion: v1
kind: Pod
metadata:
name: webapp-pod
spec:
containers:
– name: webapp
image: myapp:1.0.0

# 存活探针(健康检查)
livenessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 60
periodSeconds: 10
timeoutSeconds: 5
successThreshold: 1
failureThreshold: 3

# 就绪探针
readinessProbe:
httpGet:
path: /ready
port: 3000
initialDelaySeconds: 30
periodSeconds: 5
timeoutSeconds: 3
successThreshold: 1
failureThreshold: 3

# 启动探针
startupProbe:
httpGet:
path: /startup
port: 3000
initialDelaySeconds: 0
periodSeconds: 5
timeoutSeconds: 3
successThreshold: 1
failureThreshold: 30
“`

### 5. 资源限制

“`yaml
apiVersion: v1
kind: LimitRange
metadata:
name: resource-limits
namespace: default
spec:
limits:
– default:
cpu: 500m
memory: 512Mi
defaultRequest:
cpu: 250m
memory: 256Mi
type: Container
“`

## 第四部分:K8s实战案例

### 案例1:零停机部署

**需求**:更新应用版本,用户无感知

**实现**:

“`yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: webapp-deployment
spec:
replicas: 10

strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 2
maxUnavailable: 0

minReadySeconds: 5
revisionHistoryLimit: 10

selector:
matchLabels:
app: webapp

template:
metadata:
labels:
app: webapp
spec:
containers:
– name: webapp
image: myapp:2.0.0

ports:
– containerPort: 3000

resources:
requests:
memory: “256Mi”
cpu: “250m”
limits:
memory: “512Mi”
cpu: “500m”

lifecycle:
preStop:
exec:
command:
– sh
– -c
– “sleep 15 && node -e ‘server.close()'”

readinessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 30
periodSeconds: 5
failureThreshold: 10

livenessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 60
periodSeconds: 10
failureThreshold: 3
“`

**更新流程**:

“`bash
# 1. 更新镜像
kubectl set image deployment/webapp-deployment
webapp=myapp:2.0.0

# 2. 查看滚动更新状态
kubectl rollout status deployment/webapp-deployment

# 3. 如果有问题,立即回滚
kubectl rollout undo deployment/webapp-deployment

# 4. 查看历史版本
kubectl rollout history deployment/webapp-deployment

# 5. 回滚到指定版本
kubectl rollout undo deployment/webapp-deployment –to-revision=3
“`

### 案例2:自动扩缩容

**需求**:流量高峰期自动扩容

**实现**:

“`yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: webapp-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: webapp-deployment

minReplicas: 3
maxReplicas: 20

metrics:
– type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70

– type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80

behavior:
scaleUp:
stabilizationWindowSeconds: 0
policies:
– type: Percent
value: 100
periodSeconds: 15
– type: Pods
value: 5
periodSeconds: 15
selectPolicy: Max

scaleDown:
stabilizationWindowSeconds: 300
policies:
– type: Percent
value: 10
periodSeconds: 60
“`

**测试**:

“`bash
# 压力测试
kubectl run -it –rm stress –image=stress –restart=Never —
stress –cpu 4 –timeout 300s

# 查看扩容情况
kubectl get hpa
kubectl get pods -w
“`

## 第五部分:K8s学习路径

### 第1个月:基础入门

**学习内容**:
– K8s核心概念
– kubectl基础命令
– Pod、Deployment、Service
– 简单应用部署

**实战项目**:
– 部署Nginx
– 部署WordPress
– 配置Service和Ingress

### 第2-3个月:深入实践

**学习内容**:
– ConfigMap和Secret
– 持久化存储(PV/PVC)
– 监控和日志
– 滚动更新和回滚

**实战项目**:
– 部署有状态应用(MySQL)
– 配置CI/CD
– 实现自动扩缩容

### 第4-6个月:高级应用

**学习内容**:
– 网络策略
– 资源配额
– 优先级类
– 自定义资源(CRD)
– Operator开发

**实战项目**:
– 多租户集群
– 微服务治理
– 服务网格(Istio)

## 结语:K8s改变了我们的运维方式

### 2年总结

从K8s小白到K8s专家,这2年我经历了:

**技能提升**:
– ✅ 掌握50+个kubectl命令
– ✅ 管理100+个Pod
– ✅ 处理10+种生产故障
– ✅ 实现自动化运维

**效率提升**:
– 部署时间:2小时 → 2分钟(-98%)
– 故障恢复:1小时 → 1分钟(-98%)
– 资源利用率:20% → 70%(+250%)
– 运维人力:5人 → 2人(-60%)

**心态变化**:
– 之前:害怕自动化,担心失控
– 现在:拥抱自动化,享受效率

## 互动环节

### 你是否也遇到过这些问题?

– 你在使用K8s时遇到过什么困难?
– 你有什么独特的K8s技巧?
– 你想了解哪方面的深入内容?

**欢迎在评论区分享**你的故事和经验!

**相关文章**:
– [Docker生产环境实战](/docker-production-guide)
– [微服务架构设计](/microservices-architecture)
– [DevOps最佳实践](/devops-best-practices)

**推荐资源**:
– [K8s官方文档](https://kubernetes.io/docs/)
– [K8s实战课程](https://www.kubernetesacademy.com/)
– [Awesome K8s](https://github.com/ramitsurana/awesome-kubernetes)

**作者简介**:陈存利,全栈开发者,K8s认证管理员(CKA),2年K8s生产环境经验,现就职于一个典型的技术团队。热爱分享,运营技术博客。

**版权声明**:本文原创,转载请注明出处。

**SEO优化**:
– 关键词:Kubernetes, K8s, 容器编排, 云原生, 微服务, DevOps
– 元描述:从单机部署到K8s集群的血泪经验,分享10个生产环境常见问题和解决方案,包含完整的YAML配置示例、最佳实践和实战案例,让你的K8s集群稳定运行。