Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: [benchmark][standalone] TimeTick Lag is very high, causing DQL request timeout #36195

Open
1 task done
wangting0128 opened this issue Sep 11, 2024 · 19 comments
Open
1 task done
Assignees
Labels
kind/bug Issues or changes related a bug stale indicates no udpates for 30 days triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@wangting0128
Copy link
Contributor

wangting0128 commented Sep 11, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version:master-20240910-f4d0c589-amd64
- Deployment mode(standalone or cluster):standalone
- MQ type(rocksmq, pulsar or kafka):rocksmq    
- SDK version(e.g. pymilvus v2.0.0rc2): 2.4.5rc7
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

argo task:fouramf-bitmap-scenes-fdgrx
test case name:test_bitmap_locust_dql_dml_standalone

server:

NAME                                                              READY   STATUS             RESTARTS          AGE     IP              NODE         NOMINATED NODE   READINESS GATES
fouramf-bitmap-scenes-fdgrx-5-etcd-0                              1/1     Running            0                 3h12m   10.104.18.154   4am-node25   <none>           <none>
fouramf-bitmap-scenes-fdgrx-5-milvus-standalone-78f779649fl5ffr   1/1     Running            3 (3h10m ago)     3h12m   10.104.16.101   4am-node21   <none>           <none>
fouramf-bitmap-scenes-fdgrx-5-minio-6dcc448b8c-vnljg              1/1     Running            0                 3h12m   10.104.18.153   4am-node25   <none>           <none> 
截屏2024-09-11 19 35 16 截屏2024-09-11 19 35 35 截屏2024-09-11 19 36 22 截屏2024-09-11 19 39 02

client test result:

[2024-09-10 07:09:41,546 - ERROR - fouram]: grpc RpcError: [search], <_InactiveRpcError: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2024-09-10 07:08:41.544852', 'gRPC error': '2024-09-10 07:09:41.546277'}> (decorators.py:157)
[2024-09-10 07:09:41,547 - ERROR - fouram]: (api_response) : [Collection.search] <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.DEADLINE_EXCEEDED
	details = "Deadline Exceeded"
	debug_error_string = "UNKNOWN:Error received from peer  {grpc_message:"Deadline Exceeded", grpc_status:4, created_time:"2024-09-10T07:09:41.545818402+00:00"}"
>, [requestId: 828b5b70-6f43-11ef-99b2-72ddfb74a677] (api_request.py:57)
[2024-09-10 07:09:41,547 - ERROR - fouram]: [CheckFunc] search request check failed, response:<_InactiveRpcError of RPC that terminated with:
	status = StatusCode.DEADLINE_EXCEEDED
	details = "Deadline Exceeded"
	debug_error_string = "UNKNOWN:Error received from peer  {grpc_message:"Deadline Exceeded", grpc_status:4, created_time:"2024-09-10T07:09:41.545818402+00:00"}"
> (func_check.py:106)
[2024-09-10 07:09:41,548 - ERROR - fouram]: [ClientTask] 
Traceback (most recent call last):
  File "/src/fouram/client/concurrent/locust_client.py", line 28, in wrapper
    result = func(*args, **kwargs)
  File "/src/fouram/client/cases/base.py", line 874, in concurrent_search
    return self.collection_wrap.search(data=_data, **params.obj_params)
  File "/src/fouram/client/client_base/collection_wrapper.py", line 144, in search
    check_result = ResponseChecker(res, func_name, check_task, check_items, res_result, data=data,
  File "/src/fouram/client/check/func_check.py", line 85, in run
    result = self.check_search_output(self.response, self.succ, self.check_items)
  File "/src/fouram/client/check/func_check.py", line 274, in check_search_output
    self.assert_success(actual_res_check, True)
  File "/src/fouram/client/check/func_check.py", line 107, in assert_success
    assert actual is expect
AssertionError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/locust/user/task.py", line 347, in run
    self.execute_next_task()
  File "/usr/local/lib/python3.8/dist-packages/locust/user/task.py", line 372, in execute_next_task
    self.execute_task(self._task_queue.pop(0))
  File "/usr/local/lib/python3.8/dist-packages/locust/user/task.py", line 493, in execute_task
    task(self.user)
  File "/src/fouram/client/concurrent/locust_client.py", line 46, in search
    self.client.search(self.tasks_params.search.params)
  File "/src/fouram/client/concurrent/locust_client.py", line 36, in wrapper
    raise Exception(f"[ClientTask] {e}")
Exception: [ClientTask] 
 (task.py:366)
[2024-09-10 07:09:44,983 -  INFO - fouram]: Print locust final stats. (locust_runner.py:56)
[2024-09-10 07:09:44,984 -  INFO - fouram]: Type     Name                                                                          # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s (stats.py:789)
[2024-09-10 07:09:44,984 -  INFO - fouram]: --------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|----------- (stats.py:789)
[2024-09-10 07:09:44,984 -  INFO - fouram]: grpc     delete                                                                           541     0(0.00%) |   8094       1  107954      7 |    0.05        0.00 (stats.py:789)
[2024-09-10 07:09:44,984 -  INFO - fouram]: grpc     flush                                                                            573     0(0.00%) | 116451     506  668524  57000 |    0.05        0.00 (stats.py:789)
[2024-09-10 07:09:44,984 -  INFO - fouram]: grpc     hybrid_search                                                                    547  529(96.71%) |    236       0   92795      0 |    0.05        0.05 (stats.py:789)
[2024-09-10 07:09:44,984 -  INFO - fouram]: grpc     insert                                                                           516     0(0.00%) |   8734       4   71060     26 |    0.05        0.00 (stats.py:789)
[2024-09-10 07:09:44,984 -  INFO - fouram]: grpc     load                                                                             562     0(0.00%) |  15715       3  119987     40 |    0.05        0.00 (stats.py:789)
[2024-09-10 07:09:44,984 -  INFO - fouram]: grpc     query                                                                            542  496(91.51%) |    389       0   78990      0 |    0.05        0.05 (stats.py:789)
[2024-09-10 07:09:44,984 -  INFO - fouram]: grpc     search                                                                           549  495(90.16%) |    481       0   51179      0 |    0.05        0.05 (stats.py:789)
[2024-09-10 07:09:44,984 -  INFO - fouram]: --------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|----------- (stats.py:789)
[2024-09-10 07:09:44,984 -  INFO - fouram]:          Aggregated                                                                      3830 1520(39.69%) |  22206       0  668524      6 |    0.35        0.14 (stats.py:789)
[2024-09-10 07:09:44,985 -  INFO - fouram]:  (stats.py:790)
[2024-09-10 07:09:44,989 -  INFO - fouram]: [PerfTemplate] Report data: 
{'server': {'deploy_tool': 'helm',
            'deploy_mode': 'standalone',
            'config_name': 'standalone_16c64m',
            'config': {'standalone': {'resources': {'limits': {'cpu': '16.0', 'memory': '64Gi'}, 'requests': {'cpu': '9.0', 'memory': '33Gi'}}},
                       'cluster': {'enabled': False},
                       'etcd': {'replicaCount': 1, 'metrics': {'enabled': True, 'podMonitor': {'enabled': True}}},
                       'minio': {'mode': 'standalone', 'metrics': {'podMonitor': {'enabled': True}}},
                       'pulsar': {'enabled': False},
                       'metrics': {'serviceMonitor': {'enabled': True}},
                       'log': {'level': 'debug'},
                       'image': {'all': {'repository': 'harbor.milvus.io/milvus/milvus', 'tag': 'master-20240910-f4d0c589-amd64'}}},
            'host': 'fouramf-bitmap-scenes-fdgrx-5-milvus.qa-milvus.svc.cluster.local',
            'port': '19530',
            'uri': ''},
 'client': {'test_case_type': 'ConcurrentClientBase',
            'test_case_name': 'test_bitmap_locust_dql_dml_standalone',
            'test_case_params': {'dataset_params': {'metric_type': 'L2',
                                                    'dim': 128,
                                                    'max_length': 100,
                                                    'scalars_index': {'int8_1': {'index_type': 'BITMAP'},
                                                                      'int16_1': {'index_type': 'BITMAP'},
                                                                      'int32_1': {'index_type': 'BITMAP'},
                                                                      'int64_1': {'index_type': 'BITMAP'},
                                                                      'varchar_1': {'index_type': 'BITMAP'},
                                                                      'bool_1': {'index_type': 'BITMAP'},
                                                                      'array_int8_1': {'index_type': 'BITMAP'},
                                                                      'array_int16_1': {'index_type': 'BITMAP'},
                                                                      'array_int32_1': {'index_type': 'BITMAP'},
                                                                      'array_int64_1': {'index_type': 'BITMAP'},
                                                                      'array_varchar_1': {'index_type': 'BITMAP'},
                                                                      'array_bool_1': {'index_type': 'BITMAP'}},
                                                    'vectors_index': {'sparse_float_vector': {'index_type': 'SPARSE_INVERTED_INDEX',
                                                                                              'index_param': {'drop_ratio_build': 0.2},
                                                                                              'metric_type': 'IP'}},
                                                    'scalars_params': {'array_int8_1': {'params': {'max_capacity': 13},
                                                                                        'other_params': {'dataset': 'random_algorithm',
                                                                                                         'algorithm_params': {'algorithm_name': 'random_range',
                                                                                                                              'specify_range': [-128, 128],
                                                                                                                              'max_capacity': 13}}},
                                                                       'array_int16_1': {'params': {'max_capacity': 13},
                                                                                         'other_params': {'dataset': 'random_algorithm',
                                                                                                          'algorithm_params': {'algorithm_name': 'random_range',
                                                                                                                               'specify_range': [-200, 200],
                                                                                                                               'max_capacity': 13}}},
                                                                       'array_int32_1': {'params': {'max_capacity': 13},
                                                                                         'other_params': {'dataset': 'random_algorithm',
                                                                                                          'algorithm_params': {'algorithm_name': 'specify_scope',
                                                                                                                               'specify_range': [-300, 300],
                                                                                                                               'max_capacity': 13}}},
                                                                       'array_int64_1': {'params': {'max_capacity': 13},
                                                                                         'other_params': {'dataset': 'random_algorithm',
                                                                                                          'algorithm_params': {'algorithm_name': 'fixed_value_range',
                                                                                                                               'specify_range': [-400, 432],
                                                                                                                               'batch': 50,
                                                                                                                               'max_capacity': 13}}},
                                                                       'array_varchar_1': {'params': {'max_capacity': 13},
                                                                                           'other_params': {'dataset': 'random_algorithm',
                                                                                                            'algorithm_params': {'algorithm_name': 'random_range',
                                                                                                                                 'specify_range': [-1500, 1500],
                                                                                                                                 'max_capacity': 13}}},
                                                                       'array_bool_1': {'params': {'max_capacity': 13}},
                                                                       'int8_1': {'other_params': {'dataset': 'random_algorithm',
                                                                                                   'algorithm_params': {'algorithm_name': 'random_range',
                                                                                                                        'specify_range': [-128, 128],
                                                                                                                        'max_capacity': 13}}},
                                                                       'int16_1': {'other_params': {'dataset': 'random_algorithm',
                                                                                                    'algorithm_params': {'algorithm_name': 'random_range',
                                                                                                                         'specify_range': [-200, 200],
                                                                                                                         'max_capacity': 13}}},
                                                                       'int32_1': {'other_params': {'dataset': 'random_algorithm',
                                                                                                    'algorithm_params': {'algorithm_name': 'specify_scope',
                                                                                                                         'specify_range': [-300, 300],
                                                                                                                         'max_capacity': 13}}},
                                                                       'int64_1': {'other_params': {'dataset': 'random_algorithm',
                                                                                                    'algorithm_params': {'algorithm_name': 'fixed_value_range',
                                                                                                                         'specify_range': [-400, 432],
                                                                                                                         'batch': 50,
                                                                                                                         'max_capacity': 13}}},
                                                                       'varchar_1': {'other_params': {'dataset': 'random_algorithm',
                                                                                                      'algorithm_params': {'algorithm_name': 'random_range',
                                                                                                                           'specify_range': [-1500, 1500],
                                                                                                                           'max_capacity': 13}}}},
                                                    'dataset_name': 'sift',
                                                    'dataset_size': 2000000,
                                                    'ni_per': 5000},
                                 'collection_params': {'other_fields': ['sparse_float_vector', 'int8_1', 'int16_1', 'int32_1', 'int64_1', 'varchar_1', 'bool_1',
                                                                        'array_int8_1', 'array_int16_1', 'array_int32_1', 'array_int64_1', 'array_varchar_1',
                                                                        'array_bool_1'],
                                                       'shards_num': 1,
                                                       'auto_id': True},
                                 'resource_groups_params': {'reset': False},
                                 'database_user_params': {'reset_rbac': False, 'reset_db': False},
                                 'index_params': {'index_type': 'IVF_SQ8', 'index_param': {'nlist': 1024}},
                                 'concurrent_params': {'concurrent_number': 20, 'during_time': '3h', 'interval': 20, 'spawn_rate': None},
                                 'concurrent_tasks': [{'type': 'search',
                                                       'weight': 1,
                                                       'params': {'nq': 10,
                                                                  'top_k': 10,
                                                                  'search_param': {'nprobe': 16},
                                                                  'expr': 'int8_1 == 100',
                                                                  'guarantee_timestamp': None,
                                                                  'partition_names': None,
                                                                  'output_fields': ['id', 'float_vector', 'int64_1'],
                                                                  'ignore_growing': False,
                                                                  'group_by_field': None,
                                                                  'timeout': 60,
                                                                  'random_data': True,
                                                                  'check_task': 'check_search_output',
                                                                  'check_items': {'nq': 10}}},
                                                      {'type': 'query',
                                                       'weight': 1,
                                                       'params': {'ids': None,
                                                                  'expr': 'int64_1 > -1',
                                                                  'output_fields': ['*'],
                                                                  'offset': None,
                                                                  'limit': 10,
                                                                  'ignore_growing': False,
                                                                  'partition_names': None,
                                                                  'timeout': 60,
                                                                  'consistency_level': None,
                                                                  'random_data': False,
                                                                  'random_count': 0,
                                                                  'random_range': [0, 1],
                                                                  'field_name': 'id',
                                                                  'field_type': 'int64',
                                                                  'check_task': 'check_query_output',
                                                                  'check_items': {'expect_length': 10}}},
                                                      {'type': 'hybrid_search',
                                                       'weight': 1,
                                                       'params': {'nq': 10,
                                                                  'top_k': 1,
                                                                  'reqs': [{'search_param': {'nprobe': 128},
                                                                            'anns_field': 'float_vector',
                                                                            'expr': '(array_contains_any(array_int32_1, [0]) || array_contains(array_int64_1, '
                                                                                    '1)) || ((varchar_1 like "1%") and (bool_1 == True))',
                                                                            'top_k': 100},
                                                                           {'search_param': {'drop_ratio_search': 0.1},
                                                                            'anns_field': 'sparse_float_vector',
                                                                            'expr': 'not (int16_1 == int8_1) && ARRAY_CONTAINS_ANY(array_int64_1, [-1, 0, '
                                                                                    '1])'}],
                                                                  'rerank': {'RRFRanker': []},
                                                                  'output_fields': ['*'],
                                                                  'ignore_growing': False,
                                                                  'guarantee_timestamp': None,
                                                                  'partition_names': None,
                                                                  'timeout': 120,
                                                                  'random_data': True,
                                                                  'check_task': 'check_search_output',
                                                                  'check_items': {'output_fields': ['sparse_float_vector', 'int8_1', 'int16_1', 'int32_1',
                                                                                                    'int64_1', 'varchar_1', 'bool_1', 'array_int8_1',
                                                                                                    'array_int16_1', 'array_int32_1', 'array_int64_1',
                                                                                                    'array_varchar_1', 'array_bool_1', 'id', 'float_vector'],
                                                                                  'nq': 10}}},
                                                      {'type': 'load',
                                                       'weight': 1,
                                                       'params': {'replica_number': 1, 'timeout': 180, 'check_task': 'check_response', 'check_items': None}},
                                                      {'type': 'insert',
                                                       'weight': 1,
                                                       'params': {'nb': 10,
                                                                  'timeout': 30,
                                                                  'random_id': True,
                                                                  'random_vector': True,
                                                                  'varchar_filled': False,
                                                                  'start_id': 2000000,
                                                                  'check_task': 'check_response',
                                                                  'check_items': None}},
                                                      {'type': 'delete',
                                                       'weight': 1,
                                                       'params': {'expr': '',
                                                                  'delete_length': 10,
                                                                  'timeout': 30,
                                                                  'check_task': 'check_response',
                                                                  'check_items': None}},
                                                      {'type': 'flush',
                                                       'weight': 1,
                                                       'params': {'timeout': 600,
                                                                  'check_task': 'check_ignore_expected_errors',
                                                                  'check_items': [{'message': 'request is rejected by grpc RateLimiter middleware, please '
                                                                                              'retry later'},
                                                                                  {'message': 'wait for flush timeout'}]}}]},
            'run_id': 2024091006944126,
            'datetime': '2024-09-10 03:58:14.896097',
            'client_version': '2.5.0'},
 'result': {'test_result': {'index': {'RT': 163.2507,
                                      'sparse_float_vector': {'RT': 2.0226},
                                      'int8_1': {'RT': 0.5125},
                                      'int16_1': {'RT': 12.0945},
                                      'int32_1': {'RT': 0.5128},
                                      'int64_1': {'RT': 1.0224},
                                      'varchar_1': {'RT': 0.5131},
                                      'bool_1': {'RT': 0.5117},
                                      'array_int8_1': {'RT': 0.5106},
                                      'array_int16_1': {'RT': 0.5156},
                                      'array_int32_1': {'RT': 0.512},
                                      'array_int64_1': {'RT': 0.512},
                                      'array_varchar_1': {'RT': 0.5108},
                                      'array_bool_1': {'RT': 0.5117}},
                            'insert': {'total_time': 178.2534, 'VPS': 11219.9823, 'batch_time': 0.4456, 'batch': 5000},
                            'flush': {'RT': 3.0197},
                            'load': {'RT': 4.2674},
                            'Locust': {'Aggregated': {'Requests': 3830,
                                                      'Fails': 1520,
                                                      'RPS': 0.35,
                                                      'fail_s': 0.4,
                                                      'RT_max': 668524.17,
                                                      'RT_avg': 22206.06,
                                                      'TP50': 6,
                                                      'TP99': 439000.0},
                                       'delete': {'Requests': 541,
                                                  'Fails': 0,
                                                  'RPS': 0.05,
                                                  'fail_s': 0.0,
                                                  'RT_max': 107954.74,
                                                  'RT_avg': 8094.23,
                                                  'TP50': 7,
                                                  'TP99': 60000.0},
                                       'flush': {'Requests': 573,
                                                 'Fails': 0,
                                                 'RPS': 0.05,
                                                 'fail_s': 0.0,
                                                 'RT_max': 668524.17,
                                                 'RT_avg': 116451.01,
                                                 'TP50': 57000.0,
                                                 'TP99': 611000.0},
                                       'hybrid_search': {'Requests': 547,
                                                         'Fails': 529,
                                                         'RPS': 0.05,
                                                         'fail_s': 0.97,
                                                         'RT_max': 92795.97,
                                                         'RT_avg': 236.31,
                                                         'TP50': 0,
                                                         'TP99': 2400.0},
                                       'insert': {'Requests': 516,
                                                  'Fails': 0,
                                                  'RPS': 0.05,
                                                  'fail_s': 0.0,
                                                  'RT_max': 71060.58,
                                                  'RT_avg': 8734.32,
                                                  'TP50': 26,
                                                  'TP99': 60000.0},
                                       'load': {'Requests': 562,
                                                'Fails': 0,
                                                'RPS': 0.05,
                                                'fail_s': 0.0,
                                                'RT_max': 119987.21,
                                                'RT_avg': 15715.3,
                                                'TP50': 41,
                                                'TP99': 107000.0},
                                       'query': {'Requests': 542,
                                                 'Fails': 496,
                                                 'RPS': 0.05,
                                                 'fail_s': 0.92,
                                                 'RT_max': 78990.47,
                                                 'RT_avg': 389.64,
                                                 'TP50': 0,
                                                 'TP99': 12000.0},
                                       'search': {'Requests': 549,
                                                  'Fails': 495,
                                                  'RPS': 0.05,
                                                  'fail_s': 0.9,
                                                  'RT_max': 51179.59,
                                                  'RT_avg': 481.69,
                                                  'TP50': 0,
                                                  'TP99': 31000.0}}}}}

Expected Behavior

No response

Steps To Reproduce

concurrent test and calculation of RT and QPS

        :purpose:  `primary key: INT64 autoID`
            1. building `BITMAP` index on all supported 12 scalar fields
            2. 2 fields of different vector types
            3. verify DQL & DML requests

        :test steps:
            1. create collection with fields:
                'float_vector': 128dim
                'sparse_float_vector': sparse_range=[1, 100] <- the range of non-zero values of a sparse vector
                'id': primary key type is INT64

                all scalar fields: varchar max_length=100, array max_capacity=13
            2. build indexes:
                IVF_SQ8: 'float_vector'
                SPARSE_WAND: 'sparse_float_vector'
                BITMAP: all scalar fields
            3. insert 2 million data
            4. flush collection
            5. build indexes again using the same params
            6. load collection
            7. concurrent request:
                - search
                - query
                - hybrid_search
                - load
                - insert
                - delete: delete all inserted data
                - flush: ignore RateLimiter

Milvus Log

No response

Anything else?

No response

@wangting0128 wangting0128 added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 11, 2024
@wangting0128 wangting0128 added this to the 2.5.0 milestone Sep 11, 2024
@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 12, 2024
@yanliang567 yanliang567 removed their assignment Sep 12, 2024
@xiaofan-luan
Copy link
Collaborator

image

the cpu is full and request takes hours.
I think time out is just fine. for any system beyond it's capacity, you will timeout

@xiaofan-luan
Copy link
Collaborator

as long as service didn't crash i thought it's fine.

@wangting0128
Copy link
Contributor Author

wangting0128 commented Sep 14, 2024

as long as service didn't crash i thought it's fine.

The normal average DQL time is < 500ms. During the DQL request timeout(60s) period, the CPU is not fully utilized. I think this should be a problem that needs to be checked. 🤔️

d0bf967e-236a-4cd7-9774-18dab1740562

@wangting0128
Copy link
Contributor Author

Only 2M of data was inserted, but the Queryable Entity Num showed 44.8M, and the memory increased from 5G to 57+G.

截屏2024-09-14 10 50 56 截屏2024-09-14 10 50 46

@xiaofan-luan
Copy link
Collaborator

Only 2M of data was inserted, but the Queryable Entity Num showed 44.8M, and the memory increased from 5G to 57+G.

截屏2024-09-14 10 50 56 截屏2024-09-14 10 50 46

How did you define only 2M is inserted? is seems that you have delete and insert in the test. I think most of the 44.8M data has been deleted but not compacted in time. is this what you are trying to test? the compaction can catch up with deletes

@wangting0128
Copy link
Contributor Author

Only 2M of data was inserted, but the Queryable Entity Num showed 44.8M, and the memory increased from 5G to 57+G.
截屏2024-09-14 10 50 56 截屏2024-09-14 10 50 46

How did you define only 2M is inserted? is seems that you have delete and insert in the test. I think most of the 44.8M data has been deleted but not compacted in time. is this what you are trying to test? the compaction can catch up with deletes

  1. 2M data were inserted in the preparation phase, and 516 inserts were performed in the concurrent test phase, with 10 data inserted each time. The insertion ID was incremented from 2000000, so the total number of records inserted was 2m + 5160 = 2005160 data.
  2. The deletion was done 541 times, 10 data were deleted each time, and the deleted id was the id of the inserted data. When the number of deletions was greater than the number of insertions, 0 to 9 were used to fill the gap, so the visible data was ~ 2M
  3. This is a case to verify concurrent DQL and low DML, delete the inserted incremental data and verify the compaction
    b739234b-0bb2-4b3d-814d-bf00c41195d4

@XuanYang-cn
Copy link
Contributor

Here's what I see: search timeout(but cannot cancel)-> pining segments -> memory and segment count raising

  1. Did DataNode process normally? yes, the L0 segment and L1 segment were maintained inside a safe range.
    image
  2. Did QueryNode exchange targets normally? yes, the target keeps up with DN processing speed.
    image
  3. Why did QueryNode load so many more segments in Memory? They are pinned in the Memory wait for submitted search/query task finishing.
    image
    Offline segment tasks are queueing to wait for search done
    image

@XuanYang-cn
Copy link
Contributor

This is how search works: if thery are submiited into c++, when golang timeout and returned for like 1min, the c++ part will continuous to run.

  • When timeout, in user's eyes, they just timeout 1mins and return
  • When timeout , in Querynode's view, they just went on executing until finish for 1hrs+

CPU is down when all search/query in c++ finished. In the mean time ,some of the search/query finished during this short time.
And all the other time, search/query just failed of timeout.
image
image

@XuanYang-cn
Copy link
Contributor

The behavior is expected, nothing abnormal, except perhaps we need smaller search tasks that took less than 1hrs.

Also, we'might need to be able to cancel c++ tasks from golang side to aviod such long-time pin.
The memory status of querynode looks fragile, which means long search tasks could easily breaks querynode's memory and causing limit writing or even OOM.

/unassign
/assign @wangting0128

@XuanYang-cn
Copy link
Contributor

@wangting0128 In your tests, from the metrics, it's more likely there're 99% of VERY LONG DQL with 1% of quick DQL.

@wangting0128
Copy link
Contributor Author

The behavior is expected, nothing abnormal, except perhaps we need smaller search tasks that took less than 1hrs.

Also, we'might need to be able to cancel c++ tasks from golang side to aviod such long-time pin. The memory status of querynode looks fragile, which means long search tasks could easily breaks querynode's memory and causing limit writing or even OOM.

/unassign /assign @wangting0128

image
2M of data, only 10 pieces of data are DQL each time, but it takes 1 hour, is this reasonable?

@XuanYang-cn
Copy link
Contributor

doesn't seen reasonable, I'll look into this.

@zhagnlu
Copy link
Contributor

zhagnlu commented Nov 5, 2024

image
image
Now for expr like two column compare like A < B, if A and B has index, need to reverse look up raw data from index one by one, actually it is slow.

@wangting0128
Copy link
Contributor Author

verified scalar fields compare
argo task: fouramf-9j5lj-query-expr-3

scalar fields not build index

[2024-11-05 07:48:21,400 -  INFO - fouram]: [Base] expr of query: "int16_1 == int8_1", kwargs:{'limit': 10000} (base.py:548)
[2024-11-05 07:48:21,447 -  INFO - fouram]: [Time] Collection.query run in 0.0464s (api_request.py:49)

scalar build INVERTED index

[2024-11-05 07:58:24,558 -  INFO - fouram]: [Base] expr of query: "int16_inverted == int8_inverted", kwargs:{'limit': 10000} (base.py:548)
[2024-11-05 07:58:24,629 -  INFO - fouram]: [Time] Collection.query run in 0.0703s (api_request.py:49)

scalar build BITMAP index

[2024-11-05 07:48:24,825 -  INFO - fouram]: [Base] expr of query: "int16_bitmap == int8_bitmap", kwargs:{'limit': 10000} (base.py:548)
[2024-11-05 07:48:29,795 -  INFO - fouram]: [Time] Collection.query run in 4.9695s (api_request.py:49)

@xiaofan-luan
Copy link
Collaborator

@zhagnlu

maybe we should change bitmap index to hasRawData = false?

@xiaofan-luan
Copy link
Collaborator

also for compare, if one column has index, we could change to iterate with one row and check the other row.

But anyway compare should not be a very important op we spend time with

@zhagnlu
Copy link
Contributor

zhagnlu commented Nov 21, 2024

@zhagnlu

maybe we should change bitmap index to hasRawData = false?

if changed to false, will load raw data again , and this will increase memory cost and a heavy operation. I think compare and reverse_up is not a high use frequency operation, if latency too much, can use cache param
image
to accelerate retrieve

@xiaofan-luan
Copy link
Collaborator

We need add a new attribute on index, retreive_enabled.
For bitset index, if retreive_enabled we'd better load raw data .
Same here for quantinized vector.

@yanliang567 yanliang567 modified the milestones: 2.5.0, 2.5.1 Dec 24, 2024
@yanliang567 yanliang567 modified the milestones: 2.5.1, 2.5.2 Dec 30, 2024
@yanliang567 yanliang567 modified the milestones: 2.5.2, 2.5.3 Jan 6, 2025
@yanliang567 yanliang567 modified the milestones: 2.5.3, 2.5.4 Jan 16, 2025
@yanliang567 yanliang567 modified the milestones: 2.5.4, 2.5.5 Jan 24, 2025
Copy link

stale bot commented Feb 23, 2025

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

@stale stale bot added the stale indicates no udpates for 30 days label Feb 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug stale indicates no udpates for 30 days triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

5 participants