An effective method to use centralized Q-learning in multi-robot task allocation

Öz The use of Q-learning methods in multi-robot systems is a challenging area. Multi-robot systems have dynamic and partially observable nature because of robot’s independent decision-making and acting mechanisms. Whereas, Q-learning is defined on Markovian environments theoretically. One way to apply Q-learning in multi robot systems is centralized learning. It learns optimal Q-values for state space of overall system and joint action spaces of all agents. In this case, the system can be considered as stationary and optimal solutions can be converged. But, centralized learning requires full knowledge of the environment, perfect inter-robot communication and good computational power. Especially for large systems, the computational cost becomes huge because of exponentially growing learning space size with the number of robots. The proposed approach in this study, subG-CQL, divides the overall system into small-sized sub-groups without adversely affecting the system's task performing abilities. Each sub-group consists of less number of robots performing less tasks and learns in centralized manner for its own team. So, the learning space dimension is reduced to a reasonable level and required communication remains limited to the robots in the same the sub-group. Due the centralized learning is used, it is expected that the successful results are achieved. Experimental studies show that the proposed algorithm provides increase in the task assignment performance of the system and efficient use of system resources. Çok robotlu sistemlerde Q-öğrenme yönteminin kullanımı oldukça problemlidir. Çok robotlu sistemlerde, robotun bağımsız karar verme ve hareket etme mekanizmaları nedeniyle dinamik ve kısmen gözlemlenebilir yapıya sahiptir. Oysa, Q-öğrenme yöntemi teorik olarak Markovian olarak nitelendirilebilecek ortamlar üzerinde tanımlanmıştır. Çok robotlu sistemlerde Q-öğrenmeyi uygulamanın bir yolu, merkezi öğrenmedir. Merkezi öğrenme, tüm sistemin durum uzayı ve tüm robotların tümleşik hareket uzayları için optimal Q-değerlerini öğrenir. Bu durumda, sistem statik olarak değerlendirilmekte ve optimal çözüm yakınsama mümkün olmaktadır. Ancak, merkezi öğrenme, çevre hakkında tam bilgi edinmeyi, robotlar arası iyi bir haberleşme sağlanmasını ve iyi hesaplama gücü gerektirir. Özellikle büyük sistemler için, robot sayısındaki artışla birlikte üstel büyüyen öğrenme uzayı boyutu nedeniyle hesaplama maliyeti çok yüksek olmaktadır. Bu çalışmada önerilen yaklaşım olan subG-CQL, sistemin görev yapma yeteneklerini olumsuz yönde etkilemeden genel sistemi küçük boyutlu alt gruplara ayırır. Her bir alt grup daha az sayıda robottan oluşur, daha az görev yapar ve kendi ekibi için merkezi bir şekilde öğrenir. Böylece öğrenme alanı boyutu makul bir düzeye indirilir ve gerekli iletişim aynı alt gruptaki robotlarla sınırlı kalır. Merkezi öğrenmenin kullanılması nedeniyle başarılı sonuçlara ulaşılması beklenmektedir. Deneysel çalışmalar, önerilen algoritmanın sistemin görev atama performansında artış ve sistem kaynaklarının verimli kullanımını sağladığını göstermektedir.


Introduction
With the rapid growth of technology, multi-robot systems (MRS) become most popular especially for complex applications. MRS has the ability of faster task execution because team members can run simultaneously. MRS is highly fault-tolerant, when one robot gets out-of-run the others take over its role. And also, it has distributed sensing and acting facilities which provide wide working area, fast and flexible execution. [1]. An MRS environment is partially observable and dynamic in nature [2]. The robots in MRS operate with their own local sensing and each has its decision-making and acting mechanisms. Moreover, robot interaction and information sharing are complicated due to noisy and insufficient communication [3]. These explain why a precise and accurate coordination in MRS should be provided [4]. * Corresponding author/Yazışılan Yazar Multi-robot task allocation (MRTA) is the process of ensuring that robots do the appropriate tasks at right time in an appropriate order [5]. MRTA has a key function to get the necessary coordination and optimize system performance. One widely used approach to solve MRTA problems is auction protocols, which is a special kind of market-based approaches [6]. Auction-based task allocation approaches have advantages of implementation simplicity and distributed planning centralized decision-making ability [7]. Distributed mapping [8], multi-robot box pushing [6], multi-robot path planning [9] are some examples of auction based multi robot coordination. In auction protocols, it is considered that the tasks are items and robots are customers. Tasks are announced by auctioneer robot with base price representing their costs. Customer robots calculate the cost of announced tasks according to their own possibilities and send bids to auctioneer. The cost of a task may be travelled distance or execution time for mobile robots [10]. Auction process ends up by allocating the tasks to suitable robots in a manner that maximizes the system gain [11], [12].
In MRS, it is not possible to guarantee perfect coordination by traditional ways because of environment's partially observable and ambiguous nature [13]. Generally, tasks arise at unpredictable instants in unpredictable sequence while system is running. For this reason, it is not possible to pre-plan the assignment of tasks to robots. Task allocation needs to be performed instantaneously as the tasks appear [14]. Moreover, each robot has independent sensing, decision-making and acting mechanisms. This prevents robots from predicting others' behaviors. An efficient system coordination is only possible if robots can adopt to changing environmental conditions. Thanks to the acquisition of learning skills to the system, robots would overcome unpredicted and uncertain situations [15]. As an example of learning-based coordination, robots use their past task allocation experiences for bidding future tasks [16]. In [17], a learning-based approach to reason about future task allocation. Robots learn the bid values and use them in auction process successfully for underwater exploration which is a dynamic environment with high level uncertainty [18]. Efficient solutions are obtained by using reinforcement learning for dynamic task allocation problems in fire-disaster response [19], [20].
Q-learning (QL) is a value-function based model-free reinforcement learning method [21]. It learns optimal Q values for each state-action pair in tabular form [4]. QL is suitable to apply for complex applications, e.g., robotic systems, because it does not require environment formulation. In fact, QL is defined for MDP environments theoretically. It is problematic to apply for MRS due to dynamic and partially observable characteristics of them [22]. One way to use QL in MRS is distributed learning approach in which the robots learn only for their own state-action pairs. Whereas it is easy to implement, that the environment is not stationary so far, contradicts the requirement of MDP environment. So, distributed learning does not assure to reach optimal solution. The other way is centralized learning that works on joint state and joint action spaces. This needs perfect inter-agent communication. In centralized learning, joint action space dimension increases exponentially with the number of robots. Especially for large systems, the huge learning space dimension causes computational and implementational difficulties [23].
In this study, a new method, subG-CQL algorithm, is proposed to overcome the problems encountered in the use of QL in MRS. The main goal of this approach is to divide the system into smaller sub-groups by allowing some robots to concentrate on specific tasks without adversely affecting the system. subG-CQL algorithm has a positive all-round impact on system performance. It becomes possible to use the robots more efficiently as they deal with less variety of tasks. Learning is carried out by centralized manner in these sub-groups, which are completely independent from each other. Thus, it is possible to exploit the advantage of convergence to the optimal solution of centralized learning. In addition, the scalability problem arising from the large learning dimension, which is the biggest problem of centralized learning, has been solved. The comparative results show the successful solutions of the proposed method on system performance.
The arrangement of the paper is as follows: In Section 2, Qlearning basics for single-agent and multi-agent cases are given briefly. Section 3 examines the problem handled in this study.
The proposed approach is explained, and algorithm is given in Section 4. Section 5 is about application details, such as system structure and performance metrics. Experimental results and comments are included in Section 6. Conclusion part is in Section 7.

Q-Learning basics
Reinforcement learning is a class of machine learning techniques that do not need any mentor or system model [24]. Learning process takes place based on the feedback which is the measure of the changes in environment states as a result of agent's action. In reinforcement learning theory, this feedback is called as reward. If the agent's action causes the state to change as desired, the reward receives a value in a way that reinforces this action. In the opposite case, the reward value is in the form of penalty obstructing this action. In short, reinforcement learning techniques are methods of trial-anderror. Due it does not require any prior information about the system, reinforcement learning methods seem like a good learning approach especially for complex environments [21].
Q-learning (QL) is a reinforcement learning method based on value function approach. In QL proposed by [22], an agent learns Q values of each state-action pair by using reward received as a feedback of its actions' effect on environment states. Theoretical details of QL for both single-agent and multiagent cases are given below.

Single agent Q-Learning
Single agent Q-learning is defined on the environment defined as Markov decision process. A Markov decision process (MDP) is a tuple of < , , , >. Here, is the set of discrete and finite states of environment, is the set of discrete and finite actions of agent, : × × → ( ): [0,1] is the probabilistic state transition function and : × × → ℝ is the reward function in reel numbers [19].
At step , agent takes action ( ) ∈ then environment state, ( ) ∈ , is switched to ( + 1) ∈ . Agent receives the reward, ( ) = ( ( ), ( ), ( + 1)), as feedback of its action's effect on environment [24]. Agent's action ( ) at state ( ) is determined by agent's action policy, ℎ. In MDP, each agent has a deterministic, static, and optimal action policy [ 21]. For each step, action policy ℎ leads agent selects its action in a manner that it maximizes the expected value of overall gain. Action-value function ℎ : → , implies the expected total gain value of each state-action pair in according to action policy. Action-value function is the discounted sum of all future reward and it is expressed as in (1), where is the discount factor. Optimal action-value function is defined as -function given in equation (2) and it satisfies Bellman optimality equation [25]. * ( , ) = ℎ ℎ ( , ), ∀ ∈ ∀ ∈ Q-learning is a value-function based and model-free reinforcement learning method [26]. In Q-learning, optimal Qvalues for each-state action pair are learned in an iterative manner by the equation in (3). is the discount factor and is the learning rate [27].
This equation does not need environment model and probabilistic state transition functions. If this equation is recurred infinitely many times for each state-action pair and is appropriately diminished at each step, the learned Q-values converge to optimal ones [22].

Multi agent Q-Learning
Stochastic game (SG) is defined as the tuple of < , , , >, where is the set of finite and discrete environment states, = 1 × 2 × … .× is the joint action set for all agents. : × × → ( ): [0,1] is the state transition function defined for each state and joint action pair and : × × → , = 1 … represents the reward of each agent [27]. With this definition, an SG can be thought as the generalized form of MDP. For an SG, the state transitions are realized by joint actions of all agents.
The Nash equilibrium states the joint action policy, that each agent's action policy ensures maximum total reward value against other agents' action policy [21]. In the Nash equilibrium, total reward cannot be improved by changing one agent's action policy in the case that all other agents' action policies are kept same. Nash-Q-learning algorithm is a multiagent Q-learning aiming to reach Nash equilibrium [22]. For agent , the Q-values are updated according to equation (4) by using joint actions.
ℎ implies the Nash equilibrium for all agents.

Problem statement
In most real-world MRS applications, the working environment has partially observable and dynamic nature due to noisy sensor measurements, limited communication, and unpredictable effects of agents' actions [3]. These properties contradict the theory of reinforcement learning and explain why an optimal solution in MRS coordination cannot be reached by traditional QL algorithms [22].
One approach to apply QL for MRS is to use decentralized learning structure. In decentralized learning, each robot learns Q values for only its own states and actions by directly applying QL rules defined for single-agent case [30]. Robots do not concern with the results of other robots' actions. So, decentralized learning is simple to run, and it does not need inter-robot communication. Dimension of learning space, which consists of individual state and action spaces, is small and computational cost is low [30]. On the other hand, the MRS environment is no longer stationary due to the robots' independent actions, which contradicts the QL theory. Since robots perform the learning process individually without considering the decisions of others, behavioral conflicts are inevitable [30]. This constitutes the major reason not to reach optimal solutions [22]. However, the decentralized learning is preferred in many applications because it is easy to implement and learning space dimension is small. Successful results have been achieved for small environments under some constraints [31]. Independent Q-learning (IQL) is an example having high-degree of decentralization [32]. Empirical results show that IQL works well in simple applications only [33]. Hyper Qlearning try to solve nonstationary problem by observing other agents' actions [34]. [35] uses coordination graphs to estimate global Q values.
In centralized Q-learning, which is another approach for multiagent QL, robots learn global Q values using joint actions of whole team. Since the joint actions and joint states of all team members are considered, the MRS environment could be assumed as MDP. It is expected that the optimal solution is converged [22]. Centralized learning requires full knowledge of all robots' actions and all possible forms of states and perfect communication among robots. Deficiency in these factors results in failure to achieve the desired success [2]. Furthermore, the dimension of joint action space grows exponentially in the number of robots. This means that the learning spaces becomes huge and computational complexity increases enormously for large MRS's [28]. Whereas centralized learning promises to reach optimal solution, it is very difficult to implement.
Both decentralized and centralized learning approaches have some trade-offs. In most studies, hybrid learning schemas are proposed to combine the advantages and discord the problematic points. In these generally the learning is carried out in distributed manner, but there is an external coordination mechanism to obtain global solution. In modular Q-learning approach, robots learn their own Q values and there exist a control unit to overcome the behavior conflict [36]. Sequential Q-learning algorithm proposes the agents learn independently in a pre-determined order. Each agent observes others' actions and then learn for its own state-action pairs [37]. In CTDE algorithm, learning process is run in distributed manner with the full knowledge of environment states [38]. VDN [39] and QMIX [40] are examples which joint-action value functions are factorized into individual ones. So, learning space scalability problem is minimized. But they have constraints of being applicable to systems having at least one optimal solution [41].
In this study, subG-CQL algorithm is proposed to provide the use of QL in MRS easily. subG-CQL algorithm splits the whole system sub-groups independent with each other. Each subgroup behaves as a small-sized system and all of them consists the overall system. subG-CQL algorithm force the robots refuse some of their task types. It aims to match the tasks done by large number of robots to the robots having the ability of performing less number of task type. So, the robots carrying out large number of task types drop some of them. As a result, each robot performs less number of task type and each task type are done by less number of robots. At the end, the robots performing the same task types compose a sub-group. For each sub-group, the learning can be realized in centralized manner easily because of the small size of them. The details of the algorithm are given in Section 4.

Proposed approach: subG-CQL algorithm
Let a heterogenous MRS consists of different-skilled robots, , = 1, … , . These robots are responsible for fulfilling different type of tasks , = 1 … . In a heterogenous MRS, each robot is capable of some tasks, not all. And also, some robots are more likely to perform certain tasks due to their physical structure. Every robot can perform different number of tasks types. Robot has the robot task set (RTS), Γ , which is a list of tasks can be executed by as in (5).
is the willingness parameter of to do .
Robot-task relation matrix (RTM) carries information about robots and Γ sets, = 1. . and it is defined as in equation (6).
Row-sum of RTM, = ∑ =1 , is the number of all tasks can be done by and it is equal to the size of Γ . Column-sum of , is the number of robots that can perform tasks.
subG-CQL algorithm, proposed approach in this study, aims to divide the system into sub-groups with less number robots performing limited number of tasks for each. The idea behind of subG-CQL algorithm is that the tasks executed by large number of robots are matched to the robots having the responsibility of fewer tasks types. Thus, the variety of tasks that robots have to do will decrease and the robots performing the same type of tasks will form small-sized groups. So, it is possible to concentrate the robots on less number of task type.
Initially, there is no sub-group, all robots form just one team. At each iteration, the algorithm selects the task performed maximum number of robots. This corresponds to task such that RTM column having the highest value. This task is matched to the robot with the least value. If task or robot is already present in one of the existing sub-groups, then < , , > triplet is added to this sub-group. If not, a new sub-group is created with this triplet, the first element of it.
, represents the degree of -pair and it is calculated as in (7). = If there are more than one task having same value, the algorithm consults to the column weight which is the measure of how many robots do these tasks. The task with the lowest column weight is selected. Column weight, is calculated by using equation (8).
In the case of more than one row having minimum value, the robot with the highest willingness parameter is matched to the task.
Each robot is connected to a sub-group with a dependency parameter calculated as the sum of for the sub-group . In some cases, both and matched with each other exist in present sub-groups but different ones. Then the triplet < , , > is added to all of them. Such robots belonging to multiple sub-groups are stored in set . During the algorithm, robots' new task sets are constituted. Λ is the new task set of robot and it contains the task type 's which are matched to with the weights.
After the formation of sub-groups is completed, the robots in are forced to choose one sub-group which has the highest . The triplets related to robot are excluded from other subgroups. New task sets are edited by deleting the dropped task types dropped by . Although this task refusing procedure, subG-cQL algorithm provides each task type is performed by enough robots thanks to its matching strategy.
At the end of this step, subG-CQL algorithm has finished. Overall system is divided into sub-groups that each of them acts as an independent small-sized system. In each sub-group, learning is run in centralized manner with its own state space and joint action space of robots in this sub-group. subG-CQL algorithm is given below in detail.

System structure
For experimental studies, two different MRS, that one is a smallsized system, and the other is a great and much more complex system, are prepared.
In System-I, there exist six robots having the ability to do five different type of tasks. System-I is fully heterogenous nature due to the robots' different physical properties and different skills. Robots and related tasks in System-I are given in Table 1. Table 1. Robots and related tasks for system-I.

Robots
Related Task Type 1 , 4 , 5 2 , 4 1 , 3 4 , 5 1 1 , 3 , 5 System-II is like System-I, but it is much bigger and has more complex structure with ten robots and eight different type of tasks should be done. System-II is also highly heterogenous. Robots and task relation for System-II are shown in Table 2. For both systems, all types of tasks are equally probable. Each task type has two priority level named as high-priority and lowpriority. High-priority tasks have should be done absolutely and primarily and they are more time-consuming than the lowpriority tasks. High-priority tasks is 30%-35% of all tasks and remains are low-priority tasks.
Tasks appear at any time and in random sequence during system operation. Task allocation process is executed in auction-based manner. The tasks, that are announced but not assigned, auctioned once more at a certain time later. If it is not assigned, it will be dropped from the list. It is assumed that the tasks are allocated to any robot absolutely done.

Performance metrics
The learning-based task allocation methods are applied to the systems described above. The experimental results are evaluated in terms of three performance metric; completed task ratio, idle time ratio, and learning space dimension.
Completed task ratio (CTR) is the ratio of the number of tasks executed to the total number of tasks announced. The assigned task number is used instead of the completed task number. CTR is calculated in percent.
Idle time ratio (ITR) represents the ratio of robot's free time to whole execution time. ITR is determined by following equation (9).
Here, is the total execution time of a robot and is the duration that robot is busy with auction process, performing any task or charging.
In literature, it is given that the computational cost of Qlearning is directly related to the learning space dimension calculated as the product of state space dimension and action space dimension [23], [28]. Let be the learning space and | • | represent space dimension. The individual learning space dimension of robot is determined by equation (10), where is the state space of and is the action space of .
In decentralized learning, the learning space dimension of overall system with robot is equal to the sum of individual learning spaces given in equation (11).
Centralized learning uses joint action and state spaces. State space of whole system is the union of individual state spaces as in equation (12) and the dimension of it is in equation (13).
Joint action space is the cartesian product of individual action spaces as in equation (14) and dimension of joint action space is calculated by equation (15).
In worst case, the joint action space is in its largest form of | | = | | and this occurs when the system is fully cooperative. Learning space dimension in centralized learning is simply calculate by using equation (16).
In the proposed algorithm, the system is divided into p subgroups. Each sub-group run the learning process in centralized manner. But these sub-groups act as if it was an individual agent of whole system. So, the state space and action space dimension of each sub-group are determined by equations (13) and (15) respectively and learning space of overall system is calculated by adding up the learning space of all sub-groups.

Experimental results
Experimental studies are realized on two different systems Whose details are given in Section 4. To emphasize the impact and successful results of the proposed approach, subG-CQL, algorithm, it is compared with the three other methods given in literature, centralized Q-learning, semi-centralized Q-learning and decentralized Q-learning.
A well-known example of centralized Q-learning (CQL) method is studied in [28]. CQL method assumes the environment as an SG which is primary requirement of multi-agent Q-learning theory. Learning is carried out on the overall state space of the environment and joint action space of robots. Robots' main aim is to reach Nash equilibrium.
In semi-centralized Q-learning (semi-CQL) method, robots gather information about states of whole system and use it to determine their own behavior. But each robot learns the Qvalues of its own state-action pairs in a similar way of [38].
In decentralized Q-learning (DQL) method, each robot acts as an independent learner in [32]. The robots run learning process individually by using their own state and action spaces. And they do not care the behavior of their teammates.
subG-CQL algorithm, proposed in this study, divides the system into small groups uncorrelated with each other. Every subgroup behaves as a different system and learning is managed in centralized manner for each of them. Because the system size is diminished to small ones, the scalability problem of in CQL can be handled. As a result of subG-CQL algorithm, two sub-groups are formed in System-I. 1 robot drops 1 task from its RTS. Similarly, 6 robot renounces 5 as it can do this type of task. The sub-group structure of System-I is in Table 3. Table 3. Sub-groups formed after subG-CQL for system-I.

Sub-Group-1: Robots
Related Task Type 4 , 5 2 , 4 4 , 5 Sub-Group-2: Robots Related Task Type 1 , 3 1 1 , 3 subG-CQL algorithm creates three sub-groups in System-II. Many robots have stopped performing multiple tasks as shown in Table 4. For example, 2 robot drops 5 and 7 tasks, it performs only three types of task after subG-CQL whereas it has five types of task in its RTS originally. Sub-Group-3: Robots Related Task Type 6 , 7 4 , 6 4 , 6 , 7 6 , 7 The experimental results for four different method are analyzed below separately.

Completed task ratio (CTR)
CTR gives the ratio of completed tasks and it can be considered as the success of task allocation process. CTR values of all task types for both systems are also written in Table 5 and Table 6 for both systems. CTR values are in percent and they are rounded to the nearest integer for simplicity. CTR values are shown by graphs in for all approaches comparatively in the figures below. CTR for low-priority and high priority tasks separately in Figure 1 and Figure 2 for System-I and in Figure 3 and Figure 4 for System-II respectively. In the graphs, represents the low priority tasks of task type . Similarly shows the high-priority tasks.
Experimental results indicate that CTR values of high-priority tasks are higher than that of low-priority tasks thanks to auction-based task allocation.
As expected, CQL algorithm has the highest CTR values for each type of task. In CQL, it can be thought that the environment is nearly stationary and contains no uncertainty because overall state space of the environment and joint action spaces of all agents are considered. On the contrary, DQL algorithm results in the lowest CTR values for low-priority tasks. CTR of highpriority tasks are reasonably high. In DQL, robots learn individually for their state and action spaces. As a result of this, robots force themselves to do high-priority tasks rather than low-priority ones. semi-CQL algorithm's CTR values for lowpriority tasks are a bit higher than DQL and for high-priority tasks are lower than DQL, but not so poor. semi-CQL gather information about overall system states but it uses decentralized learning structure. This explains why its results are so similar to DQL.
subQ-CQL algorithm, proposed in this study, has satisfactorily good results of CTR for both high-priority and low-priority tasks although these values are a bit lower than the results of CQL. subG-CQL learns in centralized manner but in smaller size.
Whole system is divided into sub-groups which have no common task or robot with each other. One advantage of this algorithm is that it provides easy-to-apply centralized learning for each small-sized sub-group. The other advantage is to concentrate the robots on less variety of task type. To deal with the same task type makes much more tasks completed because every task type has different features such that duration, difficulties etc.…

Idle Time Ratio (ITR)
Robots and their abilities are considered as system resources.
In most cases, all tasks could not be done due to the scarcity of resources in MRS applications. Insufficiencies in the number of robots causes that a lot of tasks announced cannot be assigned to the robots because all robots are busy with another task at auction duration. ITR values are meaningful for effective use of resources. The reduction in the free time of the robots means that robots do right tasks at the right time. ITR values of all robots for System-I and System-II are given in Table 7 and Table 8 respectively in percent.  subG-CQL algorithm provides low ITR values because it knows all existing situations of the environment and use this information in learning process. So, task allocation process is managed in balance and robots are run effectively. The highest ITR values are obtained by DQL algorithm. In DQL, robots learn independently, and they do not concern their teammates' behavior. This causes behavior conflicts, which is the major disadvantage of decentralized learning, among team members. For example, two robots learn same action for the same task and this task is assigned one of them, the other goes to idle mode. The ITR values of semi-CQL algorithm is better than DQL and worse than CQL. This is because it knows the other robots' actions but learn in decentralized manner.
The best ITR values is achieved by subG-CQL algorithm. It forces some robots drop some of its type of task and concentrate on less task type. Thus, it becomes possible to utilize the robots effectively and idle time of robots decreases as desired. ITR values of robots are drawn graphically Figure 5 and Figure 6 for System-I and System-II respectively. Figure 5. ITR of robots for System-I. Figure 6. ITR of robots for System-II.

Learning space dimension
Learning space dimension whose details are explained in Section 5.2, is directly related to the learning schema and system size. The learning space dimensions for both systems are given in Table 9. The learning space dimension of System-II is higher than that of System-I, because the size of System-II much larger than System-I in all algorithms. The smallest learning space dimension is obtained for DQL and semi-CQL which use decentralized learning. CQL algorithm has the highest learning space dimension caused by the joint action space dimension. Especially for System-II, the learning space dimension is huge. It is a big obstruction in apply centralized learning which has a great success to reach optimal solution.
subG-CQL algorithm provides a reasonable learning space dimension compared to CQL, although both learn in centralized manner. subG-CQL algorithm divides the system into smallsized sub-groups and each one uses its own joint action and state spaces. Its learning space dimension is equal to the sum of sub-group's learning space dimensions. Whereas, CQL considers joint action and state space of whole system.
The results of the experimental studies can be summarized as follows in terms of the performance metrics discussed.
 As expected, the best CTR values, which is a sign of successful MRTA process, are obtained in CQL method for both high-priority and low-priority tasks. CTR values of the proposed algorithm, subG-CQL, are a bit lower than CQL,  The lowest ITR values are achieved by subG-CQL algorithm. Low ITR means that the robots and their abilities are effectively used and their waste time is decreased,  semi-CQL and DQL methods, both of which learns in decentralized manner, have the lowest learning space dimension. subQ-CQL algorithm propose a learning space dimension which is higher than DQL but reasonably less than CQL method. Huge learning space dimension which is the case of CQL brings the scalability problems and results in high computational cost and application difficulties.
When all metrics are evaluated together, it is seen that subG-CQL algorithm offers the optimal solution among all approaches. It has sufficiently high CTR values, good ITR values and acceptable learning space dimension. These results indicate that division of the system into sub-groups have no adverse effect on system also.

Conclusions
Multi-robot system environments are dynamic and contains high-level uncertainty due to independent sensing, decisionmaking and acting facilities of robots. Q-learning method provides optimal solution for robotic applications, but it is problematic to use in multi-robot domains. When decentralized learning structure is processed, robots do not take care the behaviors of their teammates and they ignore the effects of dynamic environment characteristics. These are the main reasons that the optimal task allocation cannot be reached. Additionally, behavior conflicts occur because of the independent actions of team members. The biggest advantage of this structure is that the small learning space dimension brings low computational load and easy application. When centralized Q-learning is used, the multi-robot system can be considered as Markovian, because the overall state space of the environment and joint action spaces of all robots are taken into account. This means that the major requirement of Q-learning to converge the optimal solution, is ensured. However, it is quite difficult to use centralized Q-learning especially for large systems. Computational cost is so high due the huge joint action space dimension increasing exponentially in the number of robots. Also, gathering the full knowledge of environment states and agents' actions needs perfect communication capability among robots. In this study, an efficient solution has been developed to use Q-learning in multi-robot systems. The subG-CQL algorithm, proposed here, divides the whole system into sub-groups which each of them behaves as an independent small-sized system. This process is carried out in a way that does not lose the task performance of the system. The combination of these sub-groups constitutes the whole system. Each sub-group use Q-learning in centralized manner. Because they have small system size in number of robots and number of tasks performed, the computational cost and communication requirement is reduced to a reasonable level. Thus, all the advantages of centralized learning are utilized. The experiments are realized comparatively for four different approaches: centralized Q-learning, semi-centralized Qlearning, decentralized Q-learning, and the proposed algorithm, subG-CQL. It is seen that a fairly good system performance is obtained by the proposed algorithm, subG-CQL, in terms of completed task ratio, idle time ratio of robots and learning space dimension. It successfully combines the advantages of decentralized and centralized learning schemas such as convergence to optimal solution, low learning space dimension, low computational and communication cost, and easy application. The experimental results emphasize the effectiveness of the proposed algorithm.

Author contribution statements
In the scope of this study, the Hatice Hilal EZERCAN KAYIR contributed for all stages which include the formation of the idea, the literature review, the construction of the theoretical background, the design and application of the study, supplying the materials used, the assessment of obtained results, the spelling and checking the article in terms of content.

Ethics committee approval and conflict of interest statement
There is no need to obtain permission from the ethics committee for the article prepared.
There is no conflict of interest with any person/institution in the article prepared.