02月21, 2017

ceph环境evacuate后虚拟机无法启动问题

最近在ceph环境测试虚拟机的一些功能,其中一个比较重要的就是清退功能(exacuate),在对exacuate测试过程中遇到如下问题:虚拟机在新的节点重建后,启动后报“No bootable device” 异常,无法正常启动。下面就这个问题我们来排查一下。

背景

从计算节点01 清退虚拟机 vm01 到 计算节点02 alt

先描述下操作的过程:

  • 将计算节点halt掉
  • 在控制节点执行清退动作
    nova evacuate 9cca2637-5263-4ffc-a620-976d2a59f838 openstack29.add.bjyt.qihoo.net --on-shared-storage
  • 到dashboard上通过VNC查看虚拟机状态 alt

可以看到虚拟机没有启动成功,提示“No bootable device”信息。

问题排查

先通过日志排查,从日志内容来看没有任何异常,流程也都是ok的。

查看与rbd的连接是否存在

因为使用的是Ceph存储,猜测在evacuate后与rbd连接异常,导致加载不到系统盘。有这种猜测的另一个原因是在出现“No bootable device”异常后,软/硬重启后虚拟机正常运行。

那就来看下连接情况,ceph monitor 的端口都为 6789.

netstat -anlp | grep 6789
tcp        0      0 10.1.1.1:41468     10.208.139.102:6789     ESTABLISHED 31568/qemu-kvm     
tcp        0      0 10.1.1.1:45172     10.208.139.84:6789      TIME_WAIT   -                  
tcp        0      0 10.1.1.1:41463     10.208.139.102:6789     ESTABLISHED 31568/qemu-kvm

从结果来看连接都是建立的,且与正常虚拟机的对比,也是相同的。

对比虚拟机qemu进程差异

先看evacuate后,未正常启动的虚拟机qemu进程信息

qemu -m 4000 -realtime mlock=off -smp 2,sockets=2,cores=1,threads=1 -object memory-backend-ram,id=ram-node0,size=2097152000,host-nodes=0,policy=bind -numa node,nodeid=0,cpus=0,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2097152000,host-nodes=1,policy=bind -numa node,nodeid=1,cpus=1,memdev=ram-node1 -uuid 61b1373f-5c0d-4bcc-b471-d5435e82cfda -smbios type=1,manufacturer=Fedora Project,product=OpenStack Nova,version=2015.1.1-10.el7.centos,serial=1a0d99ec-e580-49d3-a28c-9308e97dfc20 ....

再看下重启后的正常情况

emu     22973     1  2 11:47 ?        00:03:36 /usr/libexec/qemu-kvm -name instance-00001096 -S -machine pc-i440fx-2.6,accel=kvm,usb=off -cpu Haswell-noTSX,+abm,+pdpe1gb,+rdrand,+f16c,+osxsave,+dca,+pdcm,+xtpr,+tm2,+est,+smx,+vmx,+ds_cpl,+dtes64,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme -m 4096 -realtime mlock=off -smp 2,sockets=2,cores=1,threads=1 -uuid 9cca2637-5263-4ffc-a620-976d2a59f838 -smbios type=1,manufacturer=Fedora Project,product=OpenStack Nova,version=2015.1.1-10.el7.centos,serial=1a0d99ec-e580-49d3-a28c-9308e97dfc20,uuid=9cca2637-5263-4ffc-a620-976d2a59f838 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-instance-00001096/monitor.sock,server,nowait -mon chardev=charmonitor .....

经过简单格式化处理对比出有以下不一样的地方

起不来时候的qemu进程:
   -device lsi,id=scsi0,bus=pci.0,addr=0x4
   -device scsi-hd,bus=scsi0.0,scsi-id=0,drive=drive-scsi0-0-0,id=scsi0-0-0,bootindex=1
   -device scsi-hd,bus=scsi0.0,scsi-id=1,drive=drive-scsi0-0-1,id=scsi0-0-1

重启后正常的qemu进程
   这几行是关于磁盘设备的
   -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x4
   -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5
   -device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1
   -device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=1,drive=drive-scsi0-0-0-1,id=scsi0-0-0-1 
   下面两行是关于qemu-guest-agent的
   -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/org.qemu.guest_agent.0.instance-00001039.sock,server,nowait
   -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0

从上面我们可以看出有两处存在差异

  • 磁盘设备相关(驱动)
  • qemu-guest-agent相关

对比虚拟机xml文件差异

上面是启动异常的libvirt.xml
下面是重启后正常时的libvirt.xml文件 alt

从上面我们也可以看到有两个地方有出入:

磁盘驱动部分:
    <controller type="scsi" model="virtio-scsi"/>
qemu-guest-agent 部分:
    <     <channel type="unix">
    <       <source mode="bind" path="/var/lib/libvirt/qemu/org.qemu.guest_agent.0.instance-00001039.sock"/>
    <       <target type="virtio" name="org.qemu.guest_agent.0"/>
    <     </channel>

这两个地方是和通过进程看到的差异是相吻合的。

到这,其实我们可以发现:这两部分都是通过 image metadata 设置的。 看下我们使用的image metadata。

+------------------------------+--------------------------------------+
| Property                     | Value                                |
+------------------------------+--------------------------------------+
| OS-EXT-IMG-SIZE:size         | 53687091200                          |
| id                           | 8b218b4d-74ff-44af-bc4c-c37fb1106b03 |
| metadata hw_disk_bus         | scsi                                 |
| metadata hw_qemu_guest_agent | yes                                  |
| metadata hw_scsi_model       | virtio-scsi                          |
| minDisk                      | 0                                    |
| minRam                       | 0                                    |
| name                         | test-20160920                    |
| progress                     | 100                                  |
| status                       | ACTIVE                               |
+------------------------------+--------------------------------------+

其中我们自己添加的metada就是上面提到的两处(有木有(大笑)):

  • hw_disk_bus = scsi
  • hw_scsi_model = virtio-scsi
  • hw_qemu_guest_agent = yes

那就,做个测试,将image中的这些metada属性全部去掉,然后创建虚拟机再执行evacuate看看。

结果可想而知,successfully!

返回头你再看看那个日志内容,可以看到在执行evacuate时其image_meta确实是空: image_meta: {}

解决方案

通过撸nova代码,发现是个bug。

提交bug地址:https://bugs.launchpad.net/nova/+bug/1635160

一句话概括就是: 对于共享存储nova的逻辑是使用已有的image disk,所以就没有再去根据image metadata设置一些属性,造成我们在image metadata中 添加的属性都“丢”了。

相关代码如下: nova/compute/api.py

@check_instance_state(vm_state=[vm_states.ACTIVE, vm_states.STOPPED,
                                vm_states.ERROR])
def evacuate(self, context, instance, host, on_shared_storage,
             admin_password=None):
    """
       解释省略.....
    """
    LOG.debug('vm evacuation scheduled', instance=instance)
    inst_host = instance.host
    service = objects.Service.get_by_compute_host(context, inst_host)
    if self.servicegroup_api.service_is_up(service):
        LOG.error(_LE('Instance compute service state on %s '
                      'expected to be down, but it was up.'), inst_host)
        raise exception.ComputeServiceInUse(host=inst_host)

    instance.task_state = task_states.REBUILDING
    instance.save(expected_task_state=[None])
    self._record_action_start(context, instance, instance_actions.EVACUATE)

    return self.compute_task_api.rebuild_instance(context,
                   instance=instance,
                   new_pass=admin_password,
                   injected_files=None,
                   image_ref=None,
                   orig_image_ref=None,
                   orig_sys_metadata=None,
                   bdms=None,
                   recreate=True,
                   on_shared_storage=on_shared_storage,
                   host=host)

nova/compute/manager.py

@object_compat
@messaging.expected_exceptions(exception.PreserveEphemeralNotSupported)
@wrap_exception()
@reverts_task_state
@wrap_instance_event
@wrap_instance_fault
def rebuild_instance(self, context, instance, orig_image_ref, image_ref,
                     injected_files, new_pass, orig_sys_metadata,
                     bdms, recreate, on_shared_storage,
                     preserve_ephemeral=False):
    """
        解释省略......
    """
    context = context.elevated()
    # NOTE (ndipanov): If we get non-object BDMs, just get them from the
    # db again, as this means they are sent in the old format and we want
    # to avoid converting them back when we can just get them.
    # Remove this on the next major RPC version bump
    if (bdms and
        any(not isinstance(bdm, obj_base.NovaObject)
            for bdm in bdms)):
        bdms = None

    orig_vm_state = instance.vm_state
    with self._error_out_instance_on_exception(context, instance):
        LOG.info(_LI("Rebuilding instance"), context=context,
                  instance=instance)

        if recreate:
            if not self.driver.capabilities["supports_recreate"]:
                raise exception.InstanceRecreateNotSupported

            self._check_instance_exists(context, instance)

            # To cover case when admin expects that instance files are on
            # shared storage, but not accessible and vice versa
            if on_shared_storage != self.driver.instance_on_disk(instance):
                raise exception.InvalidSharedStorage(
                        _("Invalid state of instance files on shared"
                          " storage"))

            # 问题出在这里
            # 这里的逻辑就是 如果是 on_shared_storage 共享存储就不设置image_meta值
            if on_shared_storage:
                LOG.info(_LI('disk on shared storage, recreating using'
                             ' existing disk'))
            else:
                image_ref = orig_image_ref = instance.image_ref
                LOG.info(_LI("disk not on shared storage, rebuilding from:"
                             " '%s'"), str(image_ref))

            # NOTE(mriedem): On a recreate (evacuate), we need to update
            # the instance's host and node properties to reflect it's
            # destination node for the recreate.
            node_name = None
            try:
                compute_node = self._get_compute_info(context, self.host)
                node_name = compute_node.hypervisor_hostname
            except exception.ComputeHostNotFound:
                LOG.exception(_LE('Failed to get compute_info for %s'),
                              self.host)
            finally:
                instance.host = self.host
                instance.node = node_name
                instance.save()

        if image_ref:
            image_meta = self.image_api.get(context, image_ref)
        else:
            image_meta = {}

        下面代码省略 ......

修改后的代码如下:

@object_compat
@messaging.expected_exceptions(exception.PreserveEphemeralNotSupported)
@wrap_exception()
@reverts_task_state
@wrap_instance_event
@wrap_instance_fault
def rebuild_instance(self, context, instance, orig_image_ref, image_ref,
                     injected_files, new_pass, orig_sys_metadata,
                     bdms, recreate, on_shared_storage,
                     preserve_ephemeral=False):
    """
        解释省略......
    """
    context = context.elevated()
    # NOTE (ndipanov): If we get non-object BDMs, just get them from the
    # db again, as this means they are sent in the old format and we want
    # to avoid converting them back when we can just get them.
    # Remove this on the next major RPC version bump
    if (bdms and
        any(not isinstance(bdm, obj_base.NovaObject)
            for bdm in bdms)):
        bdms = None

    orig_vm_state = instance.vm_state
    with self._error_out_instance_on_exception(context, instance):
        LOG.info(_LI("Rebuilding instance"), context=context,
                  instance=instance)

        if recreate:
            if not self.driver.capabilities["supports_recreate"]:
                raise exception.InstanceRecreateNotSupported

            self._check_instance_exists(context, instance)

            # To cover case when admin expects that instance files are on
            # shared storage, but not accessible and vice versa
            if on_shared_storage != self.driver.instance_on_disk(instance):
                raise exception.InvalidSharedStorage(
                        _("Invalid state of instance files on shared"
                          " storage"))

            if on_shared_storage:
                LOG.info(_LI('disk on shared storage, recreating using'
                             ' existing disk'))
            else:
                image_ref = orig_image_ref = instance.image_ref
                LOG.info(_LI("disk not on shared storage, rebuilding from:"
                             " '%s'"), str(image_ref))

            # NOTE(mriedem): On a recreate (evacuate), we need to update
            # the instance's host and node properties to reflect it's
            # destination node for the recreate.
            node_name = None
            try:
                compute_node = self._get_compute_info(context, self.host)
                node_name = compute_node.hypervisor_hostname
            except exception.ComputeHostNotFound:
                LOG.exception(_LE('Failed to get compute_info for %s'),
                              self.host)
            finally:
                instance.host = self.host
                instance.node = node_name
                instance.save()

            # 修改的是这里
            if image_ref:
                image_meta = self.image_api.get(context, image_ref)
            else:
                image_meta = instance.image_meta

        下面代码省略 ......

本文链接:https://www.opsdev.cn/post/ceph-evacuate-vm-noboot.html

-- EOF --

Comments

评论加载中...

注:如果长时间无法加载,请针对 disq.us | disquscdn.com | disqus.com 启用代理。