您现在的位置:首页>外文期刊>Parallel Computing


  • 期刊名称:

    Parallel Computing

  • 中文名称: 并行计算
  • 刊频: 1.125
  • ISSN: 0167-8191
  • 出版社: -
  • 简介:
  • 排序:
  • 显示:
  • 每页:
  • 机译 针对大群和高维问题的基于GPU的并行多目标粒子群优化
    摘要: During the last couple of years, parallel MOPSO (Multi-objective Particle Swarm Optimization) with two or more objectives has gained a lot of attention in the literature on GPU computing. A number of implementations have been published for MOPSO on a GPU. However, none of them have been able to capture good enough Pareto fronts fast. In addition, the authors have pointed out their limitations in various aspects such as archive handling, picking up fewer nondominated solutions and so on. Previous literature also lacks evaluation of its MOPSO implementation with large swarms and high dimensional problems. This paper presents a faster implementation of parallel MOPSO on a GPU based on the CUDA architecture. We achieved our faster implementation by using coalescing memory access, a fast pseudorandom number generator, Thrust library, CUB library, an atomic function, parallel archiving and so on. The proposed parallel implementation of MOPSO using a master-slave model provides up to 157 times speedup compared to the corresponding CPU implementation. As the proposed implementation performs very highly even with increased size of problem dimensionality and swarm population, it can be widely used in real world optimization problems. (C) 2019 Elsevier B.V. All rights reserved.
  • 机译 AMG基于兼容GPU的加权匹配
    摘要: We describe main issues and design principles of an efficient implementation, tailored to recent generations of Nvidia Graphics Processing Units (GPUs), of an Algebraic MultiGrid (AMG) preconditioner previously proposed by one of the authors and already available in the open-source package BootCMatch: Boot-strap algebraic multigrid based on Compatible weighted Matching for standard CPUs. The AMG method relies on a new approach for coarsening sparse symmetric positive definite (s.p.d.) matrices, named coarsening based on compatible weighted matching. It exploits maximum weight matching in the adjacency graph of the sparse matrix, driven by the principle of compatible relaxation, providing a suitable aggregation of unknowns which goes beyond the limits of the usual heuristics applied in the current methods. We adopt an approximate solution of the maximum weight matching problem, based on a recently proposed parallel algorithm, referred to as the Suitor algorithm, and show that it allows us to obtain good quality coarse matrices for our AMG on GPUs. We exploit inherent parallelism of modern GPUs in all the kernels involving sparse matrix computations both for the setup of the preconditioner and for its application in a Krylov solver, outperforming preconditioners available in the original sequential CPU code as well as the single node Nvidia AmgX library. Results for a large set of linear systems arising from discretization of scalar and vector partial differential equations (PDEs) are discussed. (C) 2019 Elsevier B.V. All rights reserved.
  • 机译 LU-Cholesky QR算法用于薄QR分解
    摘要: This paper aims to propose the LU-Cholesky QR algorithms for thin QR decomposition (also called economy size or reduced QR decomposition). CholeskyQR is known as a fast algorithm employed for thin QR decomposition, and CholeskyQR2 aims to improve the orthogonality of a Q-factor computed by CholeskyQR. Although such Cholesky QR algorithms can efficiently be implemented in high-performance computing environments, they are not applicable for ill-conditioned matrices, as compared to the Householder QR and the Gram-Schmidt algorithms. To address this problem, we apply the concept of LU decomposition to the Cholesky QR algorithms, i.e., the idea is to use LU-factors of a given matrix as preconditioning before applying Cholesky decomposition. Moreover, we present rounding error analysis of the proposed algorithms on the orthogonality and residual of computed QR-factors. Numerical examples provided in this paper illustrate the efficiency of the proposed algorithms in parallel computing on both shared and distributed memory computers. (C) 2019 Elsevier B.V. All rights reserved.
  • 机译 选定的FETI粗糙空间投影仪实施策略的比较
    • 作者:;
    • 刊名:Parallel Computing
    • 2020年第May期
    摘要: This paper deals with scalability improvements of the FETI (Finite Element Tearing and Interconnecting) domain decomposition method solving elliptic PDEs. The main bottleneck of FETI is the solution of a coarse problem that is part of the projector onto the natural coarse space. This paper introduces and compares two strategies for the FETI coarse problem solution. The first one is a classical solution with either direct (factorization + forward/backward substitutions) or iterative solvers (conjugate gradient and deflated conjugate gradient methods). The second one is the assembly of an explicit inverse using a direct solver with the coarse problem solution realised by dense matrix-vector products. MPI subcommunicators are employed to increase arithmetic intensity and, crucially, to decrease the communication cost. PERMON library for quadratic programming implementing the Total FETI variant of FETI was used for the numerical experiments. (C) 2020 Elsevier B.V. All rights reserved.
  • 机译 具有多个子串排除约束的最长公共子序列问题的基于CGM的高效并行算法
    • 作者:;
    • 刊名:Parallel Computing
    • 2020年第Mar.期
    摘要: A variant of the Longest Common Subsequence (LCS) problem is the LCS problem with multiple substring-exclusion constraints (M-STR-EC-LCS), which has great importance in many fields especially in bioinformatics. This problem consists to compute the LCS of two strings X and Y of length n and m respectively that excluded a set of d constraints P = {P-1, P-2, ..., P-d) of total length r. Recently, Wang et al. proposed a sequential solution based on the dynamic programming technique that requires O(nmr) execution time and space. To the best of our knowledge, there is no parallel solutions for this problem. This paper describes new efficient parallel algorithms on Coarse Grained Multicomputer model (CGM) to solve this problem. Firstly, we propose a multi-level Direct Acyclic Graph (DAG) that determines the correct evaluation order of sub-problems in order to avoid redundancy due to overlap. Secondly, we propose two CGM parallel algorithms based on our DAG. The first algorithm is based on a regular partitioning of the DAG and requires O(nmr/p) execution time with O(p) communication rounds where p is the number of processors used. Its main drawback is high idleness time of processors because due to the dependencies between the nodes in the DAG, over time it has many idle processors. The second algorithm uses an irregular partitioning of the DAG that minimizes this idleness time by allowing the processors to stay active as long as possible. It requires O(nmr/p) execution time with O(kp) communication rounds. k is a constant integer allowing to setup the irregular partitioning. The both algorithms require O(r vertical bar Sigma vertical bar/p) preprocessing time where vertical bar Sigma vertical bar is the length of the alphabet. The experimental results performed show a good agreement with theoretical predictions. (C) 2019 Elsevier B.V. All rights reserved.
  • 机译 客座社论:混合百亿亿次系统的应用程序和系统软件特刊
    • 作者:;
    • 刊名:Parallel Computing
    • 2020年第Mar.期
  • 机译 关于超音速射流配置的CFD工具的可扩展性
    • 作者:;
    • 刊名:Parallel Computing
    • 2020年第May期
    摘要: New regulations are imposing noise emissions limitations for the aviation industry which are pushing researchers and engineers to invest efforts in studying the aeroacoustics phenomena. Following this trend, an in-house computational fluid dynamics tool is build to reproduce high fidelity results of supersonic jet flows for aeroacoustic analogy applications. The solver is written using the large eddy simulation formulation that is discretized using a finite difference approach and an explicit time integration. Numerical simulations of supersonic jet flows are very expensive and demand efficient high-performance computing. Therefore, non-blocking message passage interface protocols and parallel Input/Output features are implemented into the code in order to perform simulations which demand up to one billion grid points. The present work addresses the evaluation of code improvements along with the computational performance of the solver running on a computer with maximum theoretical peak of 2.727 PFlops. Different mesh configurations, whose size varies from a few hundred thousand to approximately one billion grid points, are evaluated in the present paper. Calculations are performed using different workloads in order to assess the strong and weak scalability of the parallel computational tool. Moreover, validation results of a realistic flow condition are also presented in the current work. (C) 2020 Elsevier B.V. All rights reserved.
  • 机译 消息传递中匹配对近似的有效算法
    • 作者:;
    • 刊名:Parallel Computing
    • 2020年第Mar.期
    摘要: Asynchronous message passing paradigm is commonly used in high performance computing (HPC). Message non-determinism makes the error detection in message passing programs very difficult. The prior work uses an over-approximation of the precise match pair records (each is a pair of a send and a receive that may potentially match in the runtime) to capture all possible message communication in a concurrent trace program (CTP). Symbolic model checking with such a set of match pairs is able to witness program properties including deadlock, message race, and zero-buffer compatibility. However, the approach is inefficient because of the exponential ways of match pair resolution, where most of them are non-feasible for property witnessing. This paper presents an effective under-approximation algorithm that is able to shrink the generated match pair set, thus prune most non-feasible match pairs. The algorithm first sections each process in a CTP such that each potential sender distributes roughly a bounded number of sends to match the same number of receives in the process, and then approximating the match pairs for the sends and receives in each section independently by a few simple rules with ranking. Novelty in the work is that the algorithm has the flexibility to generate the match pair set with various size based on the user input. This paper further presents that the precise match pairs for any CTP can be generated with a bounded input. The experiments over a set of benchmarks show that the symbolic model checking with the algorithm in this paper outperforms the state-of-the-art tools such that the runtime performance of property witnessing is drastically reduced as all the properties are witnessed with a small set of match pairs generated by the new algorithm. The results also show that the algorithm is able to scale to a program that employs a high degree of message non-determinism and/or a high degree of deep communication. (C) 2019 Elsevier B.V. All rights reserved.
  • 机译 GPU上的并行选择
    • 作者:;
    • 刊名:Parallel Computing
    • 2020年第Mar.期
    摘要: We present a novel parallel selection algorithm for GPUs capable of handling single rank selection (single selection) and multiple rank selection (multiselection). The algorithm requires no assumptions on the input data distribution, and has a much lower recursion depth compared to many state-of-the-art algorithms. We implement the algorithm for different GPU generations, always leveraging the respectively-available low-level communication features, and assess the performance on server-line hardware. The computational complexity of our Sample-Select algorithm is comparable to specialized algorithms designed for - and exploiting the characteristics of - "pleasant" data distributions. At the same time, as the proposed Sample-Select algorithm does not work on the actual element values but on the element ranks of the elements only, it is robust to the input data and can complete significantly faster for adversarial data distributions. We also address the use case of approximate selection by designing a variant that radically reduces the computational cost while preserving high approximation accuracy. (C) 2019 Elsevier B.V. All rights reserved.
  • 机译 Cannon型三角矩阵乘法,用于将广义HPD本征问题简化为标准形式
    • 作者:;
    • 刊名:Parallel Computing
    • 2020年第Mar.期
    摘要: We first develop a new variant of Cannon's algorithm for parallel matrix multiplication on rectangular process grids. Then we tailor it to selected situations where at least one triangular matrix is involved, namely "upper triangle of (full x upper triangular)," "lower triangle of (lower triangular x upper triangular)," and "all of (upper triangular x rectangular)." These operations arise in the transformation of generalized hermitian positive definite eigenproblems AX = BX Lambda to standard form (A) over tilde(X) over tilde = (X) over tilde Lambda, and making use of the triangular structure enables savings in arithmetic operations and communication. Numerical results show that the new implementations outperform previously available routines, and they are particularly effective if a whole sequence of generalized eigenproblems with the same matrix B must be solved, but they can also be competitive for the solution of a single generalized eigenproblem. (C) 2019 Elsevier B.V. All rights reserved.
  • 机译 数据密集型HPC应用程序的编程语言:系统映射研究
    • 作者:;
    • 刊名:Parallel Computing
    • 2020年第Mar.期
    摘要: A major challenge in modelling and simulation is the need to combine expertise in both software technologies and a given scientific domain. When High-Performance Computing (HPC) is required to solve a scientific problem, software development becomes a problematic issue. Considering the complexity of the software for HPC, it is useful to identify programming languages that can be used to alleviate this issue.Because the existing literature on the topic of HPC is very dispersed, we performed a Systematic Mapping Study (SMS) in the context of the European COST Action cHiPSet. This literature study maps characteristics of various programming languages for data-intensive HPC applications, including category, typical user profiles, effectiveness, and type of articles.We organised the SMS in two phases. In the first phase, relevant articles are identified employing an automated keyword-based search in eight digital libraries. This lead to an initial sample of 420 papers, which was then narrowed down in a second phase by human inspection of article abstracts, titles and keywords to 152 relevant articles published in the period 2006-2018. The analysis of these articles enabled us to identify 26 programming languages referred to in 33 of relevant articles. We compared the outcome of the mapping study with results of our questionnaire-based survey that involved 57 HPC experts.The mapping study and the survey revealed that the desired features of programming languages for data-intensive HPC applications are portability, performance and usability. Furthermore, we observed that the majority of the programming languages used in the context of data-intensive HPC applications are text-based general-purpose programming languages. Typically these have a steep learning curve, which makes them difficult to adopt. We believe that the outcome of this study will inspire future research and development in programming languages for data-intensive HPC applications. (C) 2019 Elsevier B.V. All rights reserved.
  • 机译 可变大小的批量高斯-乔丹消除算法,用于图形处理器上的块-雅各比预处理
    摘要:In this work, we address the efficient realization of block-Jacobi preconditioning on graphics processing units (GPUs). This task requires the solution of a collection of small and independent linear systems. To fully realize this implementation, we develop a variablesize batched matrix inversion kernel that uses Gauss-Jordan elimination (GJE) along with a variable-size batched matrix-vector multiplication kernel that transforms the linear systems' right-hand sides into the solution vectors. Our kernels make heavy use of the increased register count and the warp-local communication associated with newer GPU architectures. Moreover, in the matrix inversion, we employ an implicit pivoting strategy that migrates the workload (i.e., operations) to the place where the data resides instead of moving the data to the executing cores. We complement the matrix inversion with extraction and insertion strategies that allow the block-Jacobi preconditioner to be set up rapidly. The experiments on NVlDlA's K40 and P100 architectures reveal that our variable-size batched matrix inversion routine outperforms the CUDA basic linear algebra subroutine (cuBLAS) library functions that provide the same (or even less) functionality. We also show that the preconditioner setup and preconditioner application cost can be somewhat offset by the faster convergence of the iterative solver. (C) 2018 Elsevier B.V. All rights reserved.
  • 机译 很小矩阵的高性能矩阵矩阵乘法的算法和优化技术
    摘要:Expressing scientific computations in terms of BLAS, and in particular the general dense matrix-matrix multiplication (GEMM), is of fundamental importance for obtaining high performance portability across architectures. However, GEMMs for small matrices of sizes smaller than 32 are not sufficiently optimized in existing libraries. We consider the computation of many small GEMMs and its performance portability for a wide range of computer architectures, including Intel CPUs, ARM, IBM, Intel Xeon Phi, and GPUs. These computations often occur in applications like big data analytics, machine learning, high-order finite element methods (FEM), and others. The GEMMs are grouped together in a single batched routine. For these cases, we present algorithms and their optimization techniques that are specialized for the matrix sizes and architectures of interest. We derive a performance model and show that the new developments can be tuned to obtain performance that is within 90% of the optimal for any of the architectures of interest. For example, on a V100 GPU for square matrices of size 32, we achieve an execution rate of about 1600 gigaFLOP/s in double-precision arithmetic, which is 95% of the theoretically derived peak for this computation on a V100 GPU. We also show that these results outperform currently available state-of-the-art implementations such as vendor-tuned math libraries, including Intel MKL and NVIDIA CUBLAS, as well as open-source libraries like OpenBLAS and Eigen. (C) 2018 Elsevier B.V. All rights reserved.
  • 机译 使用Coarray Fortran的弹性计算应用程序
    摘要:With the increase in the number of hardware components and layers of the software stack in High Performance Computing (HPC) there will likely be an increment in number of hardware and software failures, which will be user-visible. Even under the most optimistic assumptions about the individual components reliability, probabilistic amplification from using millions of nodes has a dramatic impact on the Mean Time Between Failure (MTBF) of the entire platform.Although several techniques to address this problem have been developed, the support provided by the programming model, for the user to mitigate or work around this issue, is still insufficient. The Fortran 2018 standard defines failed images, a new feature that allows the programmer to detect and manage image failures in a parallel program.In this paper we show how to use failed images and teams, another feature defined in the Fortran 2018 standard, to implement resilient computational applications. (C) 2018 Elsevier B.V. All rights reserved.
  • 机译 基于有限体积的可压缩非定常流解题伴随的两级计算图法
    摘要:The adjoint method is a useful tool for finding gradients of design objectives with respect to system parameters for fluid dynamics simulations. But the utility of this method is hampered by the difficulty in writing an efficient implementation for the adjoint flow solver, especially one that scales to thousands of cores. This paper demonstrates a Python library, called adFVM, that can be used to construct an explicit unsteady flow solver and derive the corresponding discrete adjoint flow solver using automatic differentiation (AD). The library uses a two-level computational graph method for representing the structure of both solvers. The library translates this structure into a sequence of optimized kernels, significantly reducing its execution time and memory footprint. Kernels can be generated for heterogeneous architectures including distributed memory, shared memory and accelerator based systems. The library is used to write a finite volume based compressible flow solver. A wall clock time comparison between different flow solvers and adjoint flow solvers built using this library and state of the art graph based AD libraries is presented on a turbo-machinery flow problem. Performance analysis of the flow solvers is carried out for CPUs and GPUs. Results of strong and weak scaling of the flow solver and its adjoint are demonstrated on subsonic flow in a periodic box. (C) 2018 Elsevier B.V. All rights reserved.
  • 机译 分布式层析成像重建的几何分割方法
    摘要:Tomography is a powerful technique for 3D imaging of the interior of an object. With the growing sizes of typical tomographic data sets, the computational requirements for algorithms in tomography are rapidly increasing. Parallel and distributed-memory methods for tomographic reconstruction are therefore becoming increasingly common. An underexposed aspect is the effect of the data distribution on the performance of distributed-memory reconstruction algorithms. In this work, we introduce a geometric partitioning method, which takes into account the acquisition geometry and aims to minimize the necessary communication between nodes for distributed-memory forward projection and back projection operations. These operations are crucial subroutines for an important class of reconstruction methods. We show that the choice of data distribution has a significant impact on the run-time of these methods. With our novel partitioning method we reduce the communication volume drastically compared to straightforward distributions, by up to 90% for a number of cases, and furthermore we guarantee a specified load balance. (C) 2018 Elsevier B.V. All rights reserved.
  • 机译 汽车实时系统中超核并行化模式的评估和建模
    摘要:Today's combustion engines are finely tuned to deliver as much performance as possible out of only little amount of fuel. To achieve such high efficiency a lot of computational power is needed in Engine Management Systems (EMSs), which nowadays is delivered by multicore processors. However, this is a challenge for software developers as most of them are not yet familiar with the specifics of multicore programming. The real-time requirements of an EMSs further complicates software development.This paper revisits the supercore embedded Parallel Design Pattern, which reduces the fork-join-overhead by ensuring concurrent execution of coupled tasks. To test our pattern we implemented two algorithms using the supercore pattern on a state of the art EMS with an Infineon Aurix TC39x processor. We show that by using the supercore pattern we were able to reduce the response time of the analyzed functions and to achieve a speedup of up to 1.97 on four cores. We also analyzed the effect of the non-uniform memory architecture, which required specific optimization measures and limits the achievable speedup to 2 on four cores.We also show how the supercore pattern is modeled by previously defined extensions to the Electronics Architecture and Software Technology - Architecture Description Language (EAST-ADL) and AUTomotive Open System ARchitecture Standard (AUTOSAR). (C) 2018 Published by Elsevier B.V.
  • 机译 还原为带形式的动态超前,用于奇异值分解
    摘要:We investigate the introduction of look-ahead in two-stage algorithms for the singular value decomposition (SVD). Our approach relies on a specialized reduction for the first stage that produces a band matrix with the same upper and lower bandwidth instead of the conventional upper triangular-band matrix. In the case of a CPU-GPU server, this alternative form accommodates a static look-ahead into the algorithm in order to overlap the reduction of the \"next\" panel on the CPU and the \"current\" trailing update on the GPU. For multicore processors, we leverage the same compact form to formulate a version of the algorithm that advances the reduction of \"future\" panels, yielding a dynamic look-ahead that overcomes the performance bottleneck that the sequential panel factorization represents. (C) 2018 Elsevier B.V. All rights reserved.
  • 机译 绩效规划:通过持续的集体运营来提高MPI可实现的绩效
    摘要:Advantages of nonblocking collective communication in MPI have been established over the past quarter century, even predating MPI-1. For regular computations with fixed communication patterns, significant additional optimizations can be revealed through the use of persistence (planned transfers) not currently available in the MPI-3 API except for a limited form of point-to-point persistence (aka half-channels) standardized since MPI-1. This paper covers the design, prototype implementation of LibPNBC (based on LibNBC), and MPI-4 standardization status of persistent nonblocking collective operations. We provide early performance results, using a modified version of NBCBench and an example application (based on 3D conjugate gradient) illustrating the potential performance enhancements for such operations. Persistent operations enable MPI implementations to make intelligent choices about algorithm and resource utilization once and amortize this decision cost across many uses in a long-running program. Evidence that this approach is of value is provided. As with non-persistent, nonblocking collective operations, the requirement for strong progress and blocking completion notification are jointly needed to maximize the benefit of such operations (e.g., to support overlap of communication with computation and/or other communication). Further enhancement of the current reference implementation, as well as additional opportunities to enhance performance through the application of these new APIs, comprise future work. (C) 2018 Published by Elsevier B.V.
  • 机译 混合云中资源请求的在线成本拒绝率调度
    摘要:The hybrid cloud computing model has attracted considerable attention in recent years. Due to security and controllability of private cloud, some resource requests are required to be scheduled in private parts of hybrid cloud. However, these requests could be rejected because of the resource limitation of private parts of hybrid cloud. In this paper, all resource requests are classified into two categories, i.e., the special requests required to process in private cloud, and the normal requests insensitive in private or public clouds. By considering the normal requests undivided to dispatch a cloud and divided to dispatch different clouds, we propose the online cost-rejection rate scheduling strategy (OCS) for the normal requests undivided, and the online scheduling strategy for some normal requests divided (OCS-divided), which could make suitable requests placement decisions in realtime and minimize the cost of renting public cloud resources with a low rate of rejected requests. Then, we transform both online models into two one-shot optimization problems by taking advantage of Lyapunov optimization techniques, and employ the optimal decay algorithm to solve the one-shot problems. The simulation results demonstrate that our scheduling strategies can achieve the trade-off between cost and rejection rate, and process the real-time resource requests in hybrid clouds. OCS-divided can achieve an average cost saving of about 25% compared with OCS and maximize the resource utilization. (C) 2018 Elsevier B.V. All rights reserved.
  • 联系方式:010-58892860转803 (工作时间) 18141920177 (微信同号)
  • 客服邮箱:kefu@capm.ac.cn
  • 京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-1 六维联合信息科技(北京)有限公司©版权所有
  • 客服微信
  • 服务号
麽公的好大好硬好深好爽想要高清日韩欧美一中文字暮2019口述在車裏下面被添在线看片免费人成视频老漢開花苞 日本 v 影_亚洲 v 影_亚洲 a 电_欧美a电 国产a影_欧美a电_日本v影_亚洲v影 国产久久亚洲美女久久-国产亚洲日韩欧美看国产 午夜国产免费视频亚洲-在线欧美 精品 第1页 免费观看三级片_免费国产Av_免费国产黄片 亚洲 自拍 校园 欧美 日韩-久青草国产在线视频 亚洲 另类 小说 国产精品-香蕉国产精品偷在线观看 国产亚洲精品免费视频-国产亚洲日韩欧美看国产 国产亚洲精品香蕉视频播放-国产免费三级a在线观看 欧美图***另类偷偷自拍-亚洲 中文 字幕 国产 综合 国产亚洲日韩欧美看国产-99国产这里有精品视频 欧美 亚洲 日韩 国产 综合-国产亚洲日韩欧洲一区 五月丁香六月综合欧美-成长在线视频免费观看 免费视频一区二区三区-国语自产拍在线视频中文 欧美免费全部免费观看-亚洲 日韩 中文 综合av 国产国产成年在线视频区-色天天综合色天天久久婷婷 国产在线视频播放社区-五月丁香六月综合缴情基地 欧美亚洲综合另类无码-日本成本人片无码免费视频 五月丁香六月综合欧美-日本成本人片视频免费 亚洲 欧美 国产 综合五月天-亚洲欧美日本国产高清 精品AV综合导航-日本在线看片免费视频 日本欧美日韩中文亚洲-日本三级无码中文字幕 在线观看免费视频日本高清-成年大片免费视频播放 不卡本日Av网站_日本av网站-夜色撩人手机免费观看 国产Av在线看的_韩国日本免费不卡在线_免费aV 岛国a视频在线观看-三分钟免费观看视频 亚洲伊人***网站-国产免费三级a在线观看 大香中文字幕伊人久热大-伊人成综合网伊人222- 免费A级毛片_中国A级毛片_午夜国产免费视频亚洲-在线欧美 精品 第1页_a片在线观看 三级a片_成 三级 观看_人 三级 写真人体 三级真人牲交 free欧美高清猪马牛 我和狗做了4年都没事 午夜国产免费视频亚洲-在线欧美 精品 第1页 bt种子搜索 同房姿势108种 使劲里面痒想要 年轻的母亲线2免费 午夜国产免费视频亚洲-在线欧美 精品 第1页 爸爸快点我坚持不住了 午夜国产免费视频亚洲-在线欧美 精品 第1页 熟透的岳 熟妇的荡欲 午夜国产免费视频亚洲-在线欧美 精品 第1页 老熟妇乱子伦视频 亚洲五月六月丁香缴情 e欧美性情一线免费http 把你干到疼得下不了床 女人床上活好是啥样的 床戏 床 戏 三个人在一个床上做了 精品国产自在现线拍 免费A级毛片 特级做人爱c级 国内偷拍在线精品 国产精品香蕉视频在线 国产精品高清视频免费 朋友的姐姐线观高清2 欧美高清videosedexohd 迷人的保姆5线观高清 天天看高清影视在线观看 一本道理高清在线播放 日本一道本高清二区 天天看免费高清影视 一区二区三区高清视频 日本一大免费高清 欧美高清vitios 高清一区高清二区 天天看高清影视在线WWW 特级aav毛片欧美免费观看 午夜国产免费视频亚洲-在线欧美 精品 第1页 天天看大片特色视频 免费A级毛片 特级做人爱c级 午夜国产免费视频亚洲-在线欧美 精品 第1页 中国A级毛片 A级人体片 香港三级 公憩关系小说 欧美三级片 秋霞理论在一l级 超级乱婬长篇小说 天堂v无码亚洲一本道 中文字幕乱码 电影在线观看 中文字幕乱码免费 中文亚洲无线码 日本无码不卡中文免费 日本一本道免费天码av 中文欧美无线码 国产av在在免费线观看 精品国产自在现线拍 亚洲AV国产AV手机在线 久久爱www免费人成 女人哪种下面最受欢迎 小妖精一天不做就难受呀 非会员试看一分钟做受小视频 女人的性承受极限 偷窥女教师 妈妈的朋友4线观高清 4攻一受同时做宿舍 我的妻子的姐姐2 电影 家里没人半夜就和姐姐 younggir第一次young 坐车跟姐姐那个 爸不不要了太满了流来了 能看到让你流水的小说 蜜汁在马背上流下来 喷个不停gif出处 喷潮白浆直流视频在线 女人喷潮完整视频 吹潮流的水能喝吗 色综合亚洲色综合吹潮 美国式禁忌 老汉开花苞 免费人做人爱的视频 午夜国产免费视频亚洲-在线欧美 精品 第1页 a级做爰片 午夜国产免费视频亚洲-在线欧美 精品 第1页 做爱网站 白小姐四肖必选一肖 younggir第一次young 宝贝我有点大你忍一下 国语自产一区第二页 不卡无在线一区二区三区观 日本一大免费高清 日本一本免费一二三区 午夜国产免费视频亚洲-在线欧美 精品 第1页 在线不卡日本v二区 w006.top 五个大佬跪在我面前叫妈 gif动态图视频第五十八期 亚洲五月六月丁香缴情 五月爱婷婷六月丁香色 综合欧美五月丁香五月 色婷亚洲五月 五月爱婷婷六月丁香色 十大免费最污的直播 口述在车里下面被添 公车上强行被灌满浓精 坐车跟姐姐那个 呵呵我要别停我要死了 么公的好大好硬好深好爽想要 使劲里面痒想要 一晚上要了小姑娘三次 想要嘛人家想啊你快点嘛 求你们不要了np 老公说想放在里面睡觉 好妈妈快点想死我了 500短篇超污多肉推荐 很肉到处做1v1青梅竹马 可以免费观看的av毛片 午夜国产免费视频亚洲-在线欧美 精品 第1页 日本毛片18禁免费 日本高清免费毛片大全 午夜国产免费视频亚洲-在线欧美 精品 第1页 午夜国产免费视频亚洲-在线欧美 精品 第1页 18岁末年禁止观看试看一分钟 美国式禁忌5一11集 我的绝色总裁未婚妻 绝味儿媳妇txt 顶级少妇 荡公乱妇 玩弄放荡人妇系列 japanesewiif0孕妇 熟妇大尺度人体艺 玩两个少妇女邻居 美妇乱人伦小说 67194成l人在线观看线路 公憩关系小说 私欲小说 杂乱小说1第403部分 老师不行我做不下去了小说 图片区 偷拍区 小说区 销魂美女图库 做爱动态图 131美女做爰图片 gif动态图出处第900期 他抬高她的腰撞到最深处 甜宠肉H双处