02月21, 2017

OpenStack Mitaka 调度分析

最近我们升级openstack到mitaka版本,对比新版本Mitaka与Kilo相比调度算法上有了一些变更,这里做一个简单的分析。

今天不是对整个调度算法从头到尾进行完整分析,我觉得没有必要,现在Google、百度一搜一堆关于OpenStack Nova 调度的文章(当然,最好还是看官方文档)。那我要说的是什么呢?我要说的是他们俩之间的一处变更---增加了对计算节点单独设置超卖的支持

Kilo版本中是在控制节点统一配置的。当然,如果你们使用了aggregate的话,每个aggregate有自己的超卖设置。 通过在计算节点单独设置超卖可以使得资源的调度更加的灵活;对于管理员来说就多了一些资源使用的策略。

Mitaka 版本3种超卖比

  • AggregateCoreFilter的cpu_allocation_ratio metadata key
  • Compute node的配置文件nova.conf支持cpu_allocation_ratio参数设置[新增]
  • Controller node nova.conf的cpu_allocation_ratio参数设置

Nova调度核心算法

  1. 从DB中拿All ComputeHosts;
  2. 根据CONF中的scheduler_default_filters过滤器循环使用每个过滤器顺序、一次对上一个filter过滤出来的ComputeHosts进行过滤(不要被绕晕喽);
  3. 对filters筛选出来的ComputeHosts进行加权处理(weights);
  4. 选出可用的ComputeHosts。

问题

QA1: Filter的顺序(CoreFilter和AggregateCoreFilter)设置是否会影响到筛选的结果?

结论:Filter的顺序不同会影响调度的结果。

QA2: 添加了对单个ComputeNode设置超卖后,这三种设置优先级是怎样的呢?

结论: 优先级:a > b > c 也就是说,当着3个值都存在时最先使用aggregate中的cpu_allocation_ratio值进行过滤筛选; 其次是计算节点本身设置的值,最后才是控制节点的全局值。

QA3: flavor设置了extra specs和没设置extra specs的请求过滤过程是怎样的? 设置了extra specs的能否调度到加入到没有加入任何aggregate中的计算节点上? 没设置extra specs的能否调度到加入了aggragate中的计算节点?

结论:比较复杂,看后面分析。

接下来就拿CoreFilter、AggregateCoreFilter、AggregateInstanceExtraSpecsFilter为例分析一下实现细节。 调度相关filters和weights

├── filters
│   ├── affinity_filter.py
│   ├── aggregate_image_properties_isolation.py
│   ├── aggregate_instance_extra_specs.py
│   ├── aggregate_multitenancy_isolation.py
│   ├── all_hosts_filter.py
│   ├── availability_zone_filter.py
│   ├── compute_capabilities_filter.py
│   ├── compute_filter.py
│   ├── core_filter.py
│   ├── disk_filter.py
│   ├── exact_core_filter.py
│   ├── exact_disk_filter.py
│   ├── exact_ram_filter.py
│   ├── extra_specs_ops.py
│   ├── image_props_filter.py
│   ├── __init__.py
│   ├── io_ops_filter.py
│   ├── isolated_hosts_filter.py
│   ├── json_filter.py
│   ├── metrics_filter.py
│   ├── numa_topology_filter.py
│   ├── num_instances_filter.py
│   ├── pci_passthrough_filter.py
│   ├── ram_filter.py
│   ├── retry_filter.py
│   ├── trusted_filter.py
│   ├── type_filter.py
│   ├── utils.py
├── 省略
└── weights
    ├── affinity.py
    ├── disk.py
    ├── __init__.py
    ├── io_ops.py
    ├── metrics.py
    └── ram.py

看下CoreFilter和AggregateCoreFilter的实现,他们都在上面filters/core_filter.py中

# 这是基类
class BaseCoreFilter(filters.BaseHostFilter):
    def _get_cpu_allocation_ratio(self, host_state, spec_obj):
        raise NotImplementedError
    def host_passes(self, host_state, spec_obj):
        """Return True if host has sufficient CPU cores."""
        # 如果计算节点没有vcpus_total属性,说明状态同步有问题
        if not host_state.vcpus_total:
            # Fail safe
            LOG.warning(_LW("VCPUs not set; assuming CPU collection broken"))
            return True
        # 从spec_obj中取出flavor中的vcpus值
        # spec_obj是flavor、image等设置的metadata key集合
        instance_vcpus = spec_obj.vcpus
        cpu_allocation_ratio = self._get_cpu_allocation_ratio(host_state,
                                                              spec_obj)
        # 根据超卖计算出总共可用的vcpu数量
        vcpus_total = host_state.vcpus_total * cpu_allocation_ratio
        # Only provide a VCPU limit to compute if the virt driver is reporting
        # an accurate count of installed VCPUs. (XenServer driver does not)
        if vcpus_total > 0:
            host_state.limits['vcpu'] = vcpus_total
            # Do not allow an instance to overcommit against itself, only
            # against other instances.
            # 计算节点vcpu数量不足够分配给虚拟机使用返回False
            if instance_vcpus > host_state.vcpus_total:
                LOG.debug("%(host_state)s does not have %(instance_vcpus)d "
                      "total cpus before overcommit, it only has %(cpus)d",
                      {'host_state': host_state,
                       'instance_vcpus': instance_vcpus,
                       'cpus': host_state.vcpus_total})
                return False
        # 计算当前剩余vcpu数量
        free_vcpus = vcpus_total - host_state.vcpus_used
        # 当前剩余量不够分配给虚拟机使用
        if free_vcpus < instance_vcpus:
            LOG.debug("%(host_state)s does not have %(instance_vcpus)d "
                      "usable vcpus, it only has %(free_vcpus)d usable "
                      "vcpus",
                      {'host_state': host_state,
                       'instance_vcpus': instance_vcpus,
                       'free_vcpus': free_vcpus})
            return False
        return True

# CoreFilter 子类
class CoreFilter(BaseCoreFilter):
    """CoreFilter filters based on CPU core utilization."""
    def _get_cpu_allocation_ratio(self, host_state, spec_obj):
        return host_state.cpu_allocation_ratio
很简单,只是重新实现了获取该计算节点cpu超卖值,过滤方法使用父类的实现逻辑。

# AggregateCoreFilter 子类
class AggregateCoreFilter(BaseCoreFilter):
    """AggregateCoreFilter with per-aggregate CPU subscription flag.
    Fall back to global cpu_allocation_ratio if no per-aggregate setting found.
    """
    def _get_cpu_allocation_ratio(self, host_state, spec_obj):
        # 从主机组metadata中获取cpu_allocation_ratio等值
        aggregate_vals = utils.aggregate_values_from_key(
            host_state,
            'cpu_allocation_ratio')
        try:
            # 也就说看计算节点单独设置的和aggregates中设置谁的优先级更高了
            # 如果aggregates metadata没有设置(compute node没有加入任何aggregates中)
            # 则直接使用DB中的值;
            # 如果设置了,则取aggregates metadata值中的最小值
            # 从这里看的话是aggregates metadata优先级更高
            # 那,和Controller节点相比这3个谁的优先级更高呢?这个问题就要看周期性任务中上报的计算节点的值是怎么获取的。
            # 详看:nova/objects/compute_node.py 中的def _from_db_object方法
            # 这里可以告诉你是: aggregates metadata > compute node 单独设置 > Controller 节点值
            ratio = utils.validate_num_values(
                aggregate_vals, host_state.cpu_allocation_ratio, cast_to=float)
        except ValueError as e:
            LOG.warning(_LW("Could not decode cpu_allocation_ratio: '%s'"), e)
            ratio = host_state.cpu_allocation_ratio
        return ratio
这里也是只重新实现了获取该计算节点cpu超卖值,不过这里就略显复杂,这里面涉及到了新加的特性:计算节点单独设置cpu_allocation_ratio的情况。

nova/objects/compute_node.py 中的def _from_db_object方法 计算节点会周期性的同步状态到DB中,其中cpu_allocation_ratio、ram_allocation_ratio、disk_allocation_ratio三个值获取的逻辑如下:

def _from_db_object(context, compute, db_compute):
    ......
    fields = set(compute.fields) - special_cases
    for key in fields:
        value = db_compute[key]
        if (key == 'cpu_allocation_ratio' or key == 'ram_allocation_ratio'
            or key == 'disk_allocation_ratio'):
            if value == 0.0:
                # Operator has not yet provided a new value for that ratio
                # on the compute node
                value = None
            if value is None:
                # ResourceTracker is not updating the value (old node)
                # or the compute node is updated but the default value has
                # not been changed
                value = getattr(CONF, key)
                if value == 0.0 and key == 'cpu_allocation_ratio':
                    # It's not specified either on the controller
                    value = 16.0
                if value == 0.0 and key == 'ram_allocation_ratio':
                    # It's not specified either on the controller
                    value = 1.5
                if value == 0.0 and key == 'disk_allocation_ratio':
                    # It's not specified either on the controller
                    value = 1.0
        compute[key] = value

AggregateInstanceExtraSpecsFilter逻辑如下:

class AggregateInstanceExtraSpecsFilter(filters.BaseHostFilter):
    """AggregateInstanceExtraSpecsFilter works with InstanceType records."""
    # Aggregate data and instance type does not change within a request
    run_filter_once_per_request = True
    def host_passes(self, host_state, spec_obj):
        """Return a list of hosts that can create instance_type
        Check that the extra specs associated with the instance type match
        the metadata provided by aggregates.  If not present return False.
        """
        # 取出flavor信息
        instance_type = spec_obj.flavor
        # If 'extra_specs' is not present or extra_specs are empty then we
        # need not proceed further
        # 如果flavor中没有设置extra specs值则返回True,通过filter审核
        if (not instance_type.obj_attr_is_set('extra_specs')
                or not instance_type.extra_specs):
            return True 
        # 取出计算节点所在aggregate的metadata值
        metadata = utils.aggregate_metadata_get_by_host(host_state)

        for key, req in six.iteritems(instance_type.extra_specs):
            # Either not scope format, or aggregate_instance_extra_specs scope
            scope = key.split(':', 1)
            if len(scope) > 1:
                if scope[0] != _SCOPE:
                    continue
                else:
                    del scope[0]
            key = scope[0]
            aggregate_vals = metadata.get(key, None)
            # 如果在flavor extra specs中设置了,但是计算节点的aggregate metadata中没设置
            # 则认为不通过
            if not aggregate_vals:
                LOG.debug("%(host_state)s fails instance_type extra_specs "
                    "requirements. Extra_spec %(key)s is not in aggregate.",
                    {'host_state': host_state, 'key': key})
                return False
            # 值两者都设置且相等,则通过审核
            for aggregate_val in aggregate_vals:
                if extra_specs_ops.match(aggregate_val, req):
                    break
            else:
                LOG.debug("%(host_state)s fails instance_type extra_specs "
                            "requirements. '%(aggregate_vals)s' do not "
                            "match '%(req)s'",
                          {'host_state': host_state, 'req': req,
                           'aggregate_vals': aggregate_vals})
                return False
        return True

so,看完以上分析,你是不是对mitaka的超卖调度有新的理解了呢?

本文链接:https://www.opsdev.cn/post/openstack-mitaka-scheduler.html

-- EOF --

Comments

评论加载中...

注:如果长时间无法加载,请针对 disq.us | disquscdn.com | disqus.com 启用代理。