A neural network layer is essentially an operator that transforms an input signal into another signal (might be over another domain). When an operator is said to be equivariant to a group \(G\), any group element acting on the input signal leads to an predictable change in the output signal.
What is equivariance?
Given
\(\Phi\) is equivariant to \(G\) if: \(\forall g\in G, \rho^{\mathcal{Y}}(g)\circ\Phi=\Phi\circ\rho^{\mathcal{X}}(g)\)
What is a Haar measure on a group \(G\)?
What is unimodular group?
When the left and right Haar meansures coincide, the underlying group \(G\) is said to be a unimodular group.
How to prove that (eg2.10) the correlation operator is equivariant to the translation group?
The correlation operator is defined as \((k \star f)(x) = (\mathcal{T}_{x}k, f)_{\mathbb{L}_{2}(\Re^d)}\).
As the translation group has left-regular representation for the space \(\mathbb{L}_{2}(\Re^d)\), we have \(\forall g\in\Re^d, (\rho(g)\circ f)(\tilde{x}) = f(g^{-1}\otimes \tilde{x}) = f(\tilde{x} - g)\). Thus, the above equation becomes \(\int_{\tilde{x}} k(\tilde{x}-x) f(\tilde{x}-g)\text{d}\tilde{x}\). Similarly, \((\rho(g)\circ (k\star f))(x) = (k\star f)(x-g)\). Thus, the above equation becomes \(\int_{\tilde{x}} k(\tilde{x}-(x-g)) f(\tilde{x})\text{d}\tilde{x}\).
Let \(x' = \tilde{x}-g\) and apply substitution. We get \(\int_{\tilde{x}} k(\tilde{x}-x) f(\tilde{x}-g)\text{d}\tilde{x} = \int_{x'} k(x' - x + g) f(x')\text{d}x'\), which is equivalent to apply the group representation on the output signal.
Any useful result from Haar measure?
(e.g., Lemma 2.3) Given \(k, f \in \mathbb{L}_{2}(G)\), \(\mathcal{L}_g\) the left-regular representation of \(g\in G\) on \(\mathbb{L}_{2}(G)\), and Haar measure \(\text{d}g\). Then we have \((\mathcal{L}_{g}k, f)_{\mathbb{L}_{2}(G)} = (k, \mathcal{L}_{g^{-1}}f)_{\mathbb{L}_{2}(G)}\)
As the textbook shows, \(LHS = \int_{G} [\mathcal{L}_{g}k](\tilde{g}) f(\tilde{g})\text{d}\tilde{g} = \int_{G} k(g^{-1}\tilde{g}) f(\tilde{g})\text{d}\tilde{g}\).
We can make substitution: \(\tilde{g}=gg'\). Then the equation becomes \(\int_{gG} k(g') f(gg')\text{d}gg'\), which, due to the measure is a Haar measure, can be written as \(\int_{G} k(g') f(gg')\text{d}g' = \int_{G} k(g') [\mathcal{L}_{g^{-1}}f](g')\text{d}g' = RHS\).
How to interpret the notations of Haar measure’s invariance?
Suppose, in the above example, \(G\) is translations on \(\Re\), \(g=-1\), and the integral interval is \([a, b]\). Then the variable substitution \(\tilde{g}=gg' = 1 + g'\) should let \(gg'=1+g'\) change along with \([a, b]\), which, in turn, lets \(g'\) change along with \(gG=[a, b] -1 = [a-1, b-1]\). Thus, Haar measure means a “uniform measure”, that is to say, different intervals have the same outer measure (length), i.e., \(\|[a, b]\|=\|[a-1, b-1]\|\).
Any concrete example of the above equation?
(e.g., EX 2.6) Because \(G=SE(d)\), we have \(\tilde{x}=g\odot x' = R_{g}x' + x_{g}\), and \(\text{d}\tilde{x}=\text{d}(R_{g} x' + x_g) = \text{d}x'\) is a Haar measure.
]]>It is often helpful to study a homogeneous space of a group.
We can start with a group and a subgroup to build the corresponding quotient space, which is homogeneous to the original group. Conversely, we can start with a space and use the stabilizer of some element of this space to build the subgroup that will lead to a quotient space, identifying (i.e., there exists an isomorphism between the two spaces) this space.
When will we say a group action is transitive?
A group action \(\odot:G\times\mathcal{X}\rightarrow\mathcal{X}\) is transitive if \(\forall x, \tilde{x}\in\mathcal{X},\exists g\in G\), s.t. \(\tilde{x}=g\odot x\).
When will a space be called homogeneous space for the group acting on it?
What is a semidirect-product?
Basically, it is a group.
It is constructed from two groups \(N\) and \(H\) with a group action \(\odot: H\times N\rightarrow N\). We denote the semidirect-product by \(N\rtimes H\).
The group product and inverse is defined as:
\[(n,h)\cdot(\tilde{n},\tilde{h}) = (n\cdot (h\odot\tilde{n}), h\cdot\tilde{h})\] \[(n,h)^{-1}=(h^{-1}\odot n^{-1}, h^{-1})\]Note that, for simplicity, we use the same symbol \(\cdot\) for these two groups’ respective group product.
What is a coset?
For a group \(G\) and a subgroup of it \(H\), $$gH={g\cdot h | h\in H}\(for any group element\)g\in G$$ is a coset. |
What is quotient space?
Given a group \(G\) and its subgroup \(H\), the quotient space \(G/H\) denotes the collection of distinct cosets $${gH | g\in G}$$. |
Thus, elements in \(G/H\) are equivalence classes, where any \(g\neq\tilde{g}\) are in the same class if \(\exists h\in H\) s.t., \(g=\tilde{g}\cdot h\).
What is a stabilizer?
Suppose there is a group action \(\odot: G\times \mathcal{X}\rightarrow \mathcal{X}\), then the stabilizer (subgroup) of \(G\) w.r.t. \(x_0 \in \mathcal{X}\) is defined as: \(Stab_{G}(x_0)=\{g|g\odot x_0 = x_0 \}\)
What is affine group?
Groups constructed by \(\Re^{d}\rtimes H\) for a certain \(H\subseteq\text{GL}(\Re^{d})\), where \(\text{GL}(\Re^d)\) denotes the general linear transformations acting on \(\Re^d\). \(\text{GL}(\Re^d)\) consists of invertible matrices acting on \(\Re^d\), and \(H\) is commonly a subgroup of \(\text{GL}(\Re^d)\).
Any example of semi-product?
Consider \(SE(d)=(\Re^{d},+)\rtimes SO(d)\). Then the group product and inverse element can be expressed as:
\[(x, R)\cdot(\tilde{x},\tilde{R})=(x+R\tilde{x}, R\tilde{R})\] \[(x, R)^{-1}=(-R^{-1}x, R^{-1})\]where \(x\in\Re^{d}\) and \(R\) is the corresponding rotation matrix of a specific angle.
Are cosets always groups?
No, a coset may not be a group.
Any example of quotient space?
Suppose \(G=\{0, 1, \ldots, 7\}\), and the group product is defined as \(g\cdot\tilde{g}=(g+\tilde{g})\text{ mod }8\).
Suppose \(H=\{0, 4\}\), then the quotient space \(G / H\) consists of \(\{0, 4\}, \{1, 5\}, \{2, 6\}, \{3, 7\}\).
What’s the relationship between \(G\) and its quotient space \(G/H\)?
Define the group action \(i \odot gH\) to be $${i\cdot j | j\in gH}$$. |
Then, for any \(gH\) and \(\tilde{g}H\), \(\exists i\in G\) s.t., \(i\odot gH = \tilde{g}H\), because, to let \(i\cdot g\cdot h = \tilde{g}\cdot h\), we could just let \(i=\tilde{g} \cdot g^{-1}\).
Thus, \(G/H\) is a homogeneous space of \(G\).
How to interpret affine space from the perspective of quotient space?
Taking SE(2)=\(\Re^2 \rtimes S^1\) as an example. The group product, according to the definition of semidirect product, is \(g_1 \cdot g_2 =(x_1, \theta_1) \cdot (x_2, \theta_2) = (x_1 + R_{\theta_1}x_2, \theta_1 + \theta_2)\). Obviously, $${(0, \theta) | \theta \in S^1 }\(is a subgroup, where\)(0, 0)\(is the group product identity, and closure is preserved. Then each equivalence class is in the form of\){ (x, \theta) | \theta \in S^1 }$$. |
Why can the representation of a semidirect product be decomposed into the function composition of their respective representations?
]]>Tailin Wu identified key points as follows:
He also suggests to read the following papers with helpful comments:
GraphCast: Learning skillful medium-range global weather forecasting
DeepMind 提出的一个用图神经网络作为代理模型来实现中程天气天气预报(7天)。其模型超越了开发了几十年的传统方法天气预报的准确度。其使用的多尺度图神经网络架构和训练方法有借鉴意义。
Fourier Neural Operator for Parametric Partial Differential Equations
本文提出了基于傅里叶变换的神经算子(neural operator)架构,能够实现函数空间之间的直接映射。其在偏微分方程的模拟中具有很好的准确率,并且能够实现superresolution。
Learning Mesh-Based Simulation with Graph Networks
提出了MeshGraphNets,能够很好的进行基于网格的模拟(mesh-based simulation),能够用于流体力学,计算机图形学等物理仿真领域。
Learning Controllable Adaptive Simulation for Multi-resolution Physics
针对众多科学模拟中的多分辨率的问题,提出了一个新的方法,用一个MeshGraphNet学习系统的演化,另一个MeshGraphNet学习空间局域的再网格化(remeshing),实现准确率和计算量的合理权衡。
Highly accurate protein structure prediction with AlphaFold
提出了著名的AlphaFold 2.0,其对蛋白质三维构象的预测的准确度极大超越了其他方法。
Deep Potential Molecular Dynamics: A Scalable Model with the Accuracy of Quantum Mechanics
提出了Deep Potential Molecular Dynamics的方法用于分子模拟。其模型包含系统的所有自然对称性,其准确率达到了量子力学精度。
E(n) Equivariant Graph Neural Networks
提出了等变图神经网络,将空间平移和旋转的等变性(equivariance)植入到图神经网络的设计中,实现了在分子性质预测的优越性能。其将对称性植入神经网络的设计的思想值得学习。
It is equivariant to \(E(n)\) and permutations. This is achieved by forward propagating in the following way:
\[m_{ij} = \phi_{e}(h_{i}^{l},h_{j}^{l},\|x_{i}^{l}-x_{j}^{l}\|^2,a_{ij})\\ x_{i}^{l+1}=x_{i}^{l}+C\sum_{j\neq i}(x_{i}^{l}-x_{j}^{l})\phi_{x}(m_{ij})\\ m_{i}=\sum_{j\neq i}m_{ij}\\ h_{i}^{l+1}=\phi_{h}(h_{i}^{l}, m_i)\]The proof sketch is intuitive. At first, keep in mind that EGNN layer transforms a tuple \((h_{i}^{l},x_{i}^{l})\) into a tuple \((h_{i}^{l+1},x_{i}^{l+1})\). Suppose \(h_{i}^{l}\) has been invariant to \(E(n)\), which is definitely correct since such group elements act on \(x_{i}^{l}\) rather than \(h_{i}^{l}\). Then \(m_{ij}\) must be invariant to \(E(n)\) because
\[(Qx_{i}^{l}+g) - (Qx_{j}^{l}+g) = Q(x_{i}^{l}-x_{j}^{l})\]and \(Q\) is orthogonal matrix for expressing rotation operation so that the distance is preserved. Finally, \(m_{i}\) is invariant to \(E(n)\) as all \(m_{ij}\) are. Thus, \(h_{i}^{l+1}\) is invariant to \(E(n)\). Similarly, \(x_{j}^{l+1}\) is equivariant to \(E(n)\).
Inferring edges. When adjacency has not been given, considering a complete graph is OK but not scalable. To learn a function mapping \(m_{ij}\) to \(e_{ij} \in [0, 1]\) for weighting \(m_{ij},\forall j\) in aggregation.
Molecular dynamics simulation. To extend EGNN with momentum, namely, transforming \((h_{i}^{l},x_{i}^{l}, v_{i}^{l})\) to \((h_{i}^{l+1},x_{i}^{l+1}, v_{i}^{l+1})\). To this end, EGNN can make forward propagation in the following way:
\[v_{i}^{l+1}=\phi_{v}(h_{i}^{l})v_{i}^{\text{init}}+C\sum_{j\neq i}(x_{i}^{l}-x_{j}^{l})\phi_{x}(m_{ij})\\ x_{i}^{l+1}=x_{i}^{l}+v_{i}^{l+1}\]Note that $v$ stands for velocity, and thus when we say \(x_{i}^{l+1}\), in this case, is still equivariant to \(E(n)\), we mean that previoius \(x\) becomes \(Qx+g\) and initial \(v\) becomes \(Qv\) rather than \(Qv+g\).
Graph Autoencoder. There is a symmetry problem, that is, when there is no or identical attributes, i.e., same \(h_{i}^{0}\) for all nodes. In this case, there would be same node embeddings for all nodes, and thus same predicted probability of the existence of \(e_{ij}\). A convention to resolve this issue is to add Gaussian noise to \(h_{i}^{0}\).
Other related concepts and questions. What spherical harmonics is. What radial direction and radial field are. How to interpret the extension with momentum. What isometric invariant is.
To be continued…
To be continued…
]]>What’s vector space?
A vector space on a field \(F\) is a set \(V\) with two binary operators that satisfy:
What’s group?
\((G, \cdot)\), where \(G\) is a set, and \(\cdot\) is the group product (i.e., a binary operator), so that
What’s Lie group? (not that formal)
It is continuous group that is also a differentiable manifold, where continuous group means \(G\) is infinite, and its group operator is continuous.
What’s subgroup?
When we say \((H,\cdot)\) is a subgroup of \((G,\cdot)\), \(H\subset G\) should preserve the closure property of \(\cdot\).
What’s group homomorphism?
Given two gruops \((G,\ast)\) and \((H,\cdot)\), a group homomorphism from the former to the latter is a function \(f\) that satisfies:
\[\forall g,\tilde{g}\in G,(h=f(g)\text{ and }\tilde{h}=f(\tilde{g}))\rightarrow h\cdot\tilde{h} = f(g\ast\tilde{g})\]What’s group action?
Given a group \((G,\cdot)\), a group action on a space \(X\) is a binary operator \(\odot\) from \(G\times X\) to \(X\), so that:
\[\forall g,\tilde{g}\in G,x\in X, g\odot(\tilde{g}\odot x) = (g\cdot\tilde{g})\odot x\]What’s representation?
Given a group \((G,\cdot)\), a representation parameterized by \(g\in G\) is a linear and invertible function \(\rho(g)\) mapping from a vector space \(V\) to itself, so that:
\[\forall g,\tilde{g}\in G,v\in V, \rho(g)\rho(\tilde{g})v = \rho(g\cdot\tilde{g})v\]What’s matrix representation?
When the dimension of \(V\), denoted by \(d\) is finite, it is equivalent to consider \(\Re^{d}\). Then any linear transformation can be expressed by \(d\times d\) matrix. So a matrix representation \(D(g)\) is a \(d\times d\) matrix that respects the properties a representation should have and acts on \(v\in \Re^{d}\) by matrix-vector multiplication. It is often said \(D(g)\in\text{GL}(d,\Re)\), where “GL” is short for general linear group.
What’s left-regular representation?
Suppose the vector space \(V\) consists of functions in \(\mathbb{L}_2(X)\), and a group \(G\) has group action on \(X\) denoted by \(\odot\). Then a left-regular representation \(\mathcal{L}_{g}\) of \(G\) acting on \(\mathbb{L}_2(X)\) is representation that satisfies:
\[\forall g\in G,\forall f\in V, \forall x\in X,[\mathcal{L}_{g}f](x)=f(g^{-1}\odot x)\]Why do we need group product?
Recall that, in chap1, we interpret cross-correlation as the inner product between the kernel (transformed by a group) and the signal. Taking input at different positions by the cross-correlation means transforming the kernel by different group members. Thus, group product can be interpreted as the composition of two transformations, and the inverse of \(g\) means the transformation cancelling out \(g\)’s effect.
Examples include translation group \(G=(\Re^d,+)\) and rotation group \(\text{SO}(2)=([0,2\pi),+_{\text{mod }2\pi})\), which is often parameterized as:
\(R_{\theta}=\left[\begin{array}{cc} cos\theta & -sin\theta \\ sin\theta & cos\theta \\ \end{array}\right],\) and group product corresponds to matrix multiplication.
So why do we need group structure?
Obviously, the translation group with its group product as “+” and vanilla scalar-vector multiplication forms a vector space. As for SO(2), with its group product as “+” and scalar multiplication (and then mod \(2\pi\)), it also forms a vector space. At least, their group product is our very familiar arithmatic operations (matrix multiplication between two rotation matrices is commutative). However, for SE(\(d\)), its group product is:
\[(x,\theta) \cdot (\tilde{x},\tilde{\theta})=(x+R_{\theta}\tilde{x},R_{\theta}R_{\tilde{\theta}}),\]which is not commutative, and thus SE(\(d\)) cannot be a vector space by regarding group product as vector addition and supplementing the definition of scalar-vector multiplication.
How to interpret “group action is a group homomorphism”?
It is a mapping from \(G\) to \(\{f_g \| g\in G\}\), and the group product of \(\{f_g \| g\in G\}\) is function composition, namely, \(\forall f_g, f_h, f_g \cdot f_h = f_g(f_{h}())\). Then we need to prove \(f_{g\cdot h}=f_g \cdot f_h\) (note that \(\cdot\) at the LHS and RHS are that of respective group), which is correct by the definition of group action.
What’s the essential property that makes a group representation left-regular?
]]>Convolution — Cross-correlation — Template matching, i.e., inner product at different “positions”
“at different positions” — “between kernel transformed by different group elements and the input signal”
CNN (positions mean different translation) — R-CNN (positions mean different roto-translation)
What’s convolution?
For \(k,f\in\mathbb{L}_{2}(X)\),
\[(k\ast f)(x)=\int_{X} k(x-\tilde{x})f(\tilde{x})d\tilde{x}\]Two functions (a kernel and a signal) are transformed into another function (signal).
What’s cross-correlation?
For \(k,f\in\mathbb{L}_{2}(X)\),
\[(k\star f)(x)=\int_{X} k(\tilde{x}-x)f(\tilde{x})d\tilde{x}\]For multi-channel signals, we just sum up the results of all channels in calculating convolution and cross-correlation.
What is translation and roto-translation operators?
Let \(k\in\mathbb{L}_{2}(\Re^d)\), then, for each \(x\in\Re^d\),
\[[\mathcal{T}_{x}k](\tilde{x})=k(\tilde{x}-x)\]We say \(\mathcal{T}\) parameterized by \(x\in\Re^d\) is a translation operator.
Let \(g=(x,\theta)\) where \(x\) is a translation vector, and \(\theta\) is the rotation angle,
\[[\mathcal{L}_{g}k](\tilde{x})=k(R_{\theta}^{-1}(\tilde{x}-x))\]We say \(\mathcal{L}\) parameterized by \(g\) is a roto-translation operator. Particularly, there \(R_{\theta}\) is a matrix for executing the rotation action and its inverse means rotating with the opposite angle (see later chapters for more details).
Other necessary definitions in this chapter include the inner product and norm of \(\mathbb{L}_{2}(X)\).
Is convolution and cross-correlation the same stuff?
Yes, they are related via kernel reflection, namely, letting \(k(x)=k'(-x),\forall x\in X\), then \(k\ast f=k'\star f\).
What is cross-correlation doing?
It makes “template matching”.
Recall that convolution is the “same” as cross-correlation, so CNN is making template matching.
What’s the source of CNN’s power?
What’s the motivation of G-CNN?
It is still making template matching with inner product but the kernel is roto-translation lifted (not just translated), that is to say, making group correlation (here roto-translation lifting correlation):
\[(k\star_{\text{SE(2)}}f)(x,\theta)=(\mathcal{L}_{g}k,f)_{\mathbb{L}_{2}(\Re^2)}=\int_{\Re^2}k(R_{\theta}^{-1}(\tilde{x}-x))f(\tilde{x})d\tilde{x}\]where the two functions (kernel \(k\) and signal \(f\)) are transformed into a higher dimensional function (signal or say feature map) with input \((x,\theta)\). Here “SE” is short for special euclidean motion group.
In this way, those two points regarded as CNN’s power are further generalized, namely, from translation to roto-translation.
]]>Q-learning is fixed point iteration.
TBD
TBD
TBD
]]>We must understand python’s threading.Condition
before diving into the implementation of hpbandster.
At first, threading.Condition
has methods acquire()
and release()
and obeys the context management protocol:
All of the objects provided by this module (i.e., threading) that have acquire() and release() methods can be used as context managers for a with statement. The acquire() method will be called when the block is entered, and release() will be called when the block is exited.
Please see ~python document for more details.
At first, the base class for all these optimizers is Master
class (see hpbandster/core/master.py), which utilizes a Dispatcher
object for
assigning tasks to free workers, report results back to the master and communicate to the nameserver.
Let’s see how it achieves this from the construction:
self.dispatcher = Dispatcher( self.job_callback, queue_callback=self.adjust_queue_size, run_id=run_id, ping_interval=ping_interval, nameserver=nameserver, nameserver_port=nameserver_port, host=host)
The first argument is Master
’s method job_callback()
, which takes in a Job
object (see hpbandster/core/dispatcher.py) once that job is finished, and do some “book keeping”, e.g., self.num_running_jobs -= 1
, self.iterations[job.id[0]].register_result(job)
, self.config_generator.new_result(job)
, and self.thread_cond.notify()
.
The argument queue_callback
is specified as Master
’s method adjust_queue_size()
, which
gets called with the number of workers in the pool on every update-cycle
It accordingly updates Master
’s job_queue_sizes
attribute and then notify all the threads waiting for the condition (i.e., self.thread_cond.notify_all()
).
The run()
method of Dispatcher
object is immediately called after its instantiation, which triggers two threads: one runs discover_workers()
and the other runs job_runner()
.
The run()
method of Master
is the entry of the whole optimization procedure:
def run(self, n_iterations=1, min_n_workers=1, iteration_kwargs = {},):
"""
run n_iterations of SuccessiveHalving
Parameters
----------
n_iterations: int
number of iterations to be performed in this run
min_n_workers: int
minimum number of workers before starting the run
"""
self.wait_for_workers(min_n_workers)
iteration_kwargs.update({'result_logger': self.result_logger})
if self.time_ref is None:
self.time_ref = time.time()
self.config['time_ref'] = self.time_ref
self.logger.info('HBMASTER: starting run at %s'%(str(self.time_ref)))
self.thread_cond.acquire()
while True:
self._queue_wait()
next_run = None
# find a new run to schedule
for i in self.active_iterations():
next_run = self.iterations[i].get_next_run()
if not next_run is None: break
if not next_run is None:
self.logger.debug('HBMASTER: schedule new run for iteration %i'%i)
self._submit_job(*next_run)
continue
else:
if n_iterations > 0: #we might be able to start the next iteration
self.iterations.append(self.get_next_iteration(len(self.iterations), iteration_kwargs))
n_iterations -= 1
continue
# at this point there is no imediate run that can be scheduled,
# so wait for some job to finish if there are active iterations
if self.active_iterations():
self.thread_cond.wait()
else:
break
self.thread_cond.release()
for i in self.warmstart_iteration:
i.fix_timestamps(self.time_ref)
ws_data = [i.data for i in self.warmstart_iteration]
return Result([copy.deepcopy(i.data) for i in self.iterations] + ws_data, self.config)
wait_for_workers()
blocks the execution until there is enough free workers, where the self.thread_cond.wait(1)
will be notified by the self.thread_cond.notify()
in job_callback()
.
self.result_logger
is used to make live logging (more details can be found here).
self.time_ref
is set to be None
in the constructor of Master
and thus becomes the current moment here.
self.thread_cond
is an object of python’s threading.Condition
, which is used for coordinating the threads.
At the first time we enter the while
loop, the _queue_wait()
method will not block the execution, as job_queue_sizes
has been changed from (-1, 0)
to (0, 1)
by adjust_queue_size()
called by discover_workers()
, where there is one worker in this example.
self.iterations
is a list intending to hold n_iterations
iterations (each is a SuccessiveHalving
object).
At the first time we enter the while
loop, active_iterations()
cannot find any active iteration. Thus, next_run
is None
, and the SuccessiveHalving
object is returned by the get_next_iteration()
method.
By continue
, we enter the while
loop again and come to the first for
loop.
The base class of SuccessiveHalving
is BaseIteration
class (see hpbandster/core/base_iteration.py), which has attribute is_finished
(False
by its construction). active_iterations()
returns [0]
(i.e., implying that the first iteration is active), and this line next_run = self.iterations[i].get_next_run()
is executed.
By this calling, then the SuccessiveHalving
object returns a tuple consisting of:
config_id
: a tuple where the first element is the iteration index of the Master
, the second element is the stage index of the SuccessiveHalving
object (starting from zero), and the config index among the considered configs at this stage.Then the returned value is fed into this method self._submit_job(*next_run)
so that the dispatcher can submit the job to the nameserver, where num_running_jobs
is increased by one.
By continue
, we enter the while
loop again. The _queue_wait()
method will block the execution untill the submitted job has finished, that is, num_running_jobs
becomes zero by the update made by job_callback()
.
The procedure goes on in this way.
]]>What do you expect?
import copy
import torch
class LogisticRegression(torch.nn.Module):
def __init__(self, in_channels, class_num):
super(LogisticRegression, self).__init__()
self.fc = torch.nn.Linear(in_channels, class_num)
def forward(self, x):
return self.fc(x)
m = LogisticRegression(2, 2)
a = m.state_dict()
print(m.state_dict())
print(a)
print("\n\n")
for v in m.parameters():
#v[0] = 123.0
#v[0].data = torch.Tensor([123.0])
v.data -= 0.5 * v.data
print("state_dict() is shallow copy:")
print(m.state_dict())
print(a)
print("\n\n")
# whether to comment this line matters!!!!!!
a = copy.deepcopy(a)
m.load_state_dict(a)
for k in a:
#a[k][1] = 456.0
#a[k][1].data = torch.Tensor([456.0])
a[k].data -= 0.5 * a[k].data
print("load_state_dict() is ? copy:")
print(m.state_dict())
print(a)
print("\n\n")
for param in m.parameters():
#param[0] = 999.0
#param[0].data = torch.Tensor([999.0])
param.data -= 0.5 * param.data
print("load_state_dict() is ? copy:")
print(m.state_dict())
print(a)
The output is as follow:
OrderedDict([('fc.weight', tensor([[-0.0472, -0.4500],
[-0.3051, -0.3033]])), ('fc.bias', tensor([-0.1162, -0.4246]))])
OrderedDict([('fc.weight', tensor([[-0.0472, -0.4500],
[-0.3051, -0.3033]])), ('fc.bias', tensor([-0.1162, -0.4246]))])
state_dict() is shallow copy:
OrderedDict([('fc.weight', tensor([[-0.0236, -0.2250],
[-0.1525, -0.1517]])), ('fc.bias', tensor([-0.0581, -0.2123]))])
OrderedDict([('fc.weight', tensor([[-0.0236, -0.2250],
[-0.1525, -0.1517]])), ('fc.bias', tensor([-0.0581, -0.2123]))])
load_state_dict() is ? copy:
OrderedDict([('fc.weight', tensor([[-0.0236, -0.2250],
[-0.1525, -0.1517]])), ('fc.bias', tensor([-0.0581, -0.2123]))])
OrderedDict([('fc.weight', tensor([[-0.0118, -0.1125],
[-0.0763, -0.0758]])), ('fc.bias', tensor([-0.0291, -0.1062]))])
load_state_dict() is ? copy:
OrderedDict([('fc.weight', tensor([[-0.0118, -0.1125],
[-0.0763, -0.0758]])), ('fc.bias', tensor([-0.0291, -0.1062]))])
OrderedDict([('fc.weight', tensor([[-0.0118, -0.1125],
[-0.0763, -0.0758]])), ('fc.bias', tensor([-0.0291, -0.1062]))])
or as follow with a = copy.deepcopy(a)
commented out:
OrderedDict([('fc.weight', tensor([[ 0.1331, 0.0882],
[ 0.2721, -0.1324]])), ('fc.bias', tensor([-0.5216, 0.4661]))])
OrderedDict([('fc.weight', tensor([[ 0.1331, 0.0882],
[ 0.2721, -0.1324]])), ('fc.bias', tensor([-0.5216, 0.4661]))])
state_dict() is shallow copy:
OrderedDict([('fc.weight', tensor([[ 0.0665, 0.0441],
[ 0.1360, -0.0662]])), ('fc.bias', tensor([-0.2608, 0.2330]))])
OrderedDict([('fc.weight', tensor([[ 0.0665, 0.0441],
[ 0.1360, -0.0662]])), ('fc.bias', tensor([-0.2608, 0.2330]))])
load_state_dict() is ? copy:
OrderedDict([('fc.weight', tensor([[ 0.0333, 0.0220],
[ 0.0680, -0.0331]])), ('fc.bias', tensor([-0.1304, 0.1165]))])
OrderedDict([('fc.weight', tensor([[ 0.0333, 0.0220],
[ 0.0680, -0.0331]])), ('fc.bias', tensor([-0.1304, 0.1165]))])
load_state_dict() is ? copy:
OrderedDict([('fc.weight', tensor([[ 0.0166, 0.0110],
[ 0.0340, -0.0165]])), ('fc.bias', tensor([-0.0652, 0.0583]))])
OrderedDict([('fc.weight', tensor([[ 0.0166, 0.0110],
[ 0.0340, -0.0165]])), ('fc.bias', tensor([-0.0652, 0.0583]))])
What makes the difference? The implementation of load_state_dict()
can be roughly understood as copy each corresponding tensor:
param_x.copy_(tensor_x)
which is different from the ordinary shallow copy
param_x = tensor_x
that leads to id(param_x) == id(tensor_x)
; and is also different from the so-called deep copy
param_x = copy.deepcopy(tensor_x)
for general Python object, e.g., list. Specifically, deep copy (a list) means allocating another memory space to hold the same values as the right-value (i.e., the source list).
In contrast, Tensor object has an attribute that records the memory address of stored values, and copy_()
just passes the addresses of left-value and right-value to CUDA API to copy the stored values.
Thus, without the command a = copy.deepcopy(a)
, the tensors in a
have the same addresses as the parameters in m
, since a
is acquired by state_dict()
. By _copy()
, parameters of m
and tensors in a
point to the same memory space, making any changes in a
’s entries observable from m
’s entries.
Conceptually, we use __new__
when you need to control the creation of a new instance and use __init__
when you need to control initialization of a new instance.
# Python program to
# demonstrate __new__
# don't forget the object specified as base
class A(object):
def __new__(cls):
print("Creating instance")
print(cls)
result = super(A, cls).__new__(cls)
print(result)
return result
def __init__(self):
print("Init is called")
obj = A()
print(obj)
Creating instance
<class '__main__.A'>
<__main__.A object at 0x7f67cadd2748>
Init is called
<__main__.A object at 0x7f67cadd2748>
class Logger(object):
def __new__(cls, *args, **kwargs):
if not hasattr(cls, '_logger'):
cls._logger = super(Logger, cls
).__new__(cls, *args, **kwargs)
return cls._logger
Local structure depends on graph size, e.g., in \(G(n, p)\), the expected degree is \(np\), hence fixing \(p\) and increasing \(n\) changes the local structure of the graph.
To characterize the local structure by graph d-pattern: def. 4.1. and fig. 2. d-pattern is defined in a recursive way and is motivated by WL test (and the analysis in GIN), e.g., 1-pattern represents the node’s degree, and a 2-pattern of a node is a pair: (its degree, the set of its neighbor’s degrees).
To connect GNN with d-patterns: On one side, GNN can be programed to output any value on any d-pattern independently. (more precise version is Thr. 4.3) Conversely, Thr. 4.2: d-layer GNN (with a node-level output additionally) will output the same results for nodes with the same d-patterns. Combining them, we can independently control the (output) values of d-layer GNNs on the set of d-patterns and these values completely determine the GNN’s output.
The relation between size generalization and d-pattern discrepancy: Thr. 5.1 \(P_1, P_2\) be finitely supported distributions of graphs \(P_1^d, P_2^d\) the distribution of d-patterns over \(P_1, P_2\) respectively assume that any graph in \(P_2\) contains a node with a d-pattern in \(P_2^d \setminus P_1^d\) then any regression task solvable by a d-depth GNN, there exists a (d+3)-depth GNN that
To improve size generalization: consdier domain adaptation setting where we have access to labeled samples from the source domain but the target domain is unlabeled SSL: propose node-regression pretext task, that is, to regression the histogram of each feature field where the histogram is aggregated at the focused node from its d-depth tree (d-hop neighborhood)
]]>