Skip to main content

Questions tagged [distributed-computing]

Utilizing more than one computer, connected to each other with a communication link to accomplish a common task.

distributed-computing
0 votes
0 answers
14 views

reliable protocol guarantee complete delivery no in order promise

The sender is sending N packets to receiver. I want a protocol or method that guarantees delivery, each packet is received at least once. It is ok if some packets are received more than once due to ...
Yufei Zheng's user avatar
-2 votes
0 answers
21 views

How to Seamlessly Distribute Application Processing Between Two Windows Computers Based on Resource Availability? [closed]

I'm looking to create a distributed computing setup using two Windows 10 computers, where applications can run on either computer based on resource availability. The goal is for users to interact with ...
Odd Magnus Grinder's user avatar
0 votes
0 answers
19 views

Does zookeeper preserver order when moving sessions?

In the zookeeper book it says: When a client creates a ZooKeeper handle using a specific language binding, it establishes a session with the service. The client initially connects to any server in ...
codefast's user avatar
0 votes
0 answers
28 views

How to securely conduct lottery-like draws with guaranteed randomness without auditing?

Is there an existing algorithm or method to conduct lottery-like draws that ensures secure and truly random results without the need for auditing? There are any lib to do this? I search on the web ...
aguiadouro's user avatar
0 votes
2 answers
32 views

Unable to run code on Multiple GPUs in PyTorch - Usage shows only 1 GPU is being utilized

I am training a Transformer Encoder-Decoder based model for Text summarization. The code works without any errors but uses only 1 GPU when checked with nvidia-smi. However, I want to run it on all the ...
Abid Meraj's user avatar
0 votes
0 answers
18 views

Out-of-memory problem when using dist.all_gather

I'm writing codes for multi-GPU training, and I need to gather embeddings from different gpus to calculate loss and then propagate the gradients back to different GPUs. However, when the programs runs ...
Drack Young's user avatar
0 votes
0 answers
24 views

I Have Imagination of Futuristic Computing Scenarios. How Can I Get Involved?

I am currently a backend developer specializing in Java-oriented web services, with a bachelor's degree in Computer Science. After working for a few years, I have become deeply interested in diving ...
alex y's user avatar
  • 1
0 votes
0 answers
25 views

The distributed training model inferred the occurrence of nan values

When I trained my Mamba model on 4 GPUs through DistributedDataParallel, after the first round of training, I executed the validation code. The validation on cuda:3 process always gave Nan values, and ...
Ezra Corli's user avatar
0 votes
0 answers
38 views

Distributed Training using PyTorch

I am using PyTorch's multiprocessing framework to distribute my training across multiple GPUs. I'm doing this over the batch size, so each GPU has its independent batch that it calculates the gradient ...
Gummy bears's user avatar
1 vote
2 answers
99 views

How to reliably implement fan out write pattern?

I'm trying to RELIABLY implement that pattern. For practical purposes, assume we have something similar to a twitter clone (in cassandra and nodejs). So, user A has 500k followers. When user A posts a ...
InglouriousBastard's user avatar
0 votes
0 answers
17 views

How to Deploy Replicaset and Custom Images in AWS via Ray Docker Images?

Getting started with Ray on AWS cluster and trying to understand the declarative yaml config as in ray github. I can see it is possible to directly add the Docker images of ray on the AWS ec2 ...
Della's user avatar
  • 1,550
1 vote
1 answer
72 views

Using torchrun with AWS sagemaker estimator on multi-GPU node

I would like to run a training job ml.p4d.24xlarge machine on AWS SageMaker. I ran into a similar issue described here with significant slowdowns in training time. I understand now that I should run ...
probably45's user avatar
2 votes
0 answers
14 views

Why am I getting a "LM_WRITE_LOG_FAILED ERROR 80000" in GridDB when writing to the log file?

I'm using GridDB for managing a distributed database system and recently encountered the following error while trying to perform operations: 80000 LM_WRITE_LOG_FAILED ERROR Writing to log file failed. ...
omar esawy's user avatar
2 votes
0 answers
11 views

Why am I getting a "JC_CONTAINER_NOT_OPENED ERROR 145034" in GridDB when performing operations on a container?

I'm working with GridDB to manage a distributed database and recently encountered the following error while performing operations on a container: 145034 JC_CONTAINER_NOT_OPENED ERROR Status check of ...
omar esawy's user avatar
0 votes
0 answers
26 views

How do microservices communicate with each other when they are secured with Jwt?

I am currently learning microservices architecture. I got to know that you can use JWT, Oauth and bunch of other mechanisms to secure microservices but one thing that confuses me is that how do they ...
Yasin Ahmed's user avatar

15 30 50 per page
1
2 3 4 5
192