Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 Upload Fails with "The XML you provided was not well-formed or did not validate against our published schema" #6849

Open
3 of 4 tasks
arthikek opened this issue Jan 26, 2025 · 3 comments
Assignees
Labels
bug This issue is a bug. p2 This is a standard priority issue

Comments

@arthikek
Copy link

Checkboxes for prior research

Describe the bug

When uploading a large file (~500 MB) to S3 using the @aws-sdk/lib-storage package, the upload fails with the error:

The XML you provided was not well-formed or did not validate against our published schema

The issue appears to occur during a multipart upload, specifically when using a Readable stream as the file's body. Smaller files upload successfully, and the problem seems related to the interaction between the stream provided and the configured partSize. When I set chunk size over 500 MB it works without any problem.

/**
 * Upload a file to S3 in buffered chunks (~5 MB) with tagging.
 */
export async function uploadFileToS3(
  bucket: string,
  file: FileDTO,
  tags?: S3UploadTags,
) {
  const functionName = "uploadFileToS3";
  const s3Client = S3CLIENT_NEW;
  log.info(`[${functionName}] Uploading file to S3: ${file.displayName}`);

  const stream : ReadableStream<Uint8Array>  = await getPDFFromExternalStorage(file.url, file.uuid);


  const finalTags: S3UploadTags = tags || {
    ConversionStatus: ConversionStatusEnum.PENDING,
    OCRStatus: OCRStatusEnum.SKIPPED,
  };

  const tagSet = Object.entries(finalTags).map(([Key, Value]) => ({
    Key,
    Value,
  }));


  console.log("Uploading file to S3 with file size: ", file.size);

  if (!stream) {
    throw new Error("Stream is empty");
  }

  const uploadParamsPDF = {
    Bucket: bucket,
    Key: file.uuid,
    Body: stream,
    ContentType: file.contentType,
    ContentLength: file.size,
    Metadata: {
      title: file.displayName,
    },
    Tagging: tagSet.map((tag) => `${tag.Key}=${tag.Value}`).join("&"),
  };

  const uploadToS3 = new Upload({
    client: s3Client,
    params: uploadParamsPDF,
    queueSize: 4,
    partSize: 50 * 1024 * 1024 // 50MB
  });


  uploadToS3.on("httpUploadProgress", (progress) => {
    log.debug(`Progress event: ${progress.loaded}/${progress.total} bytes`);
  });

  uploadToS3.addListener("error", (err) => {
    log.error(`Upload error: ${err}`);
  });

  console.log("Uploading file to S3");
  const result = await uploadToS3.done();


}

Regression Issue

  • Select this option if this issue appears to be a regression.

SDK version number

@aws-sdk/[email protected]

Which JavaScript Runtime is this issue in?

Node.js

Details of the browser/Node.js/ReactNative version

v23.5.0

Reproduction Steps

const s3Client = new S3Client({ region: "us-east-1" });


export async function uploadFileToS3(bucket, file) {
  // Simulate a Readable stream (replace this with your actual stream source)
  const stream = new Readable({
    read() {
      this.push(Buffer.alloc(1024 * 1024 * 500)); 
      this.push(null); 
    },
  });

  if (!stream) {
    throw new Error("Stream is empty");
  }

  const uploadParams = {
    Bucket: bucket,
    Key: file.uuid,
    Body: stream, // The stream to upload
    ContentType: "application/pdf", // Simplified content type
    ContentLength: file.size, // Expected file size
  };

  const upload = new Upload({
    client: s3Client,
    params: uploadParams,
    queueSize: 4,
    partSize: 5 * 1024 * 1024, 
  });

  upload.on("httpUploadProgress", (progress) => {
    console.log(`Progress: ${progress.loaded}/${progress.total} bytes`);
  });

  try {
    console.log("Uploading file to S3...");
    const result = await upload.done();
    console.log("Upload complete:", result);
  } catch (err) {
    console.error("Upload error:", err);
  }
}

Observed Behavior

[handleConversion] pUCV6phH120ZruXxV3uzqXzBvQgWHwnGcVK5hrGm: Error during conversion process The XML you provided was not well-formed or did not validate against our published schema
Error: The XML you provided was not well-formed or did not validate against our published schema
    at Worker.<anonymous> (file:///app/dist/alexandria/server.js:1061:16)
    at Worker.emit (node:events:507:28)
    at MessagePort.<anonymous> (node:internal/worker:267:53)
    at [nodejs.internal.kHybridDispatch] (node:internal/event_target:827:20)
    at MessagePort.<anonymous> (node:internal/per_context/messageport:23:28)
    at MessagePort.callbackTrampoline (node:internal/async_hooks:130:17)

Expected Behavior

I expected the file to be uploaded in chunks, and assembled by s3.

Possible Solution

No response

Additional Information/Context

No response

@arthikek arthikek added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Jan 26, 2025
@arthikek
Copy link
Author

arthikek commented Jan 27, 2025

Hi updating the package worked out. However we don't get to see the progress.total. It is set on undefined even after specifying the content length.

@aBurmeseDev aBurmeseDev self-assigned this Jan 27, 2025
@aBurmeseDev
Copy link
Member

HI @arthikek - thanks for reaching out. It sounds like the issue was resolved with the package/version update? Could you elaborate what you found?

For upload progress, are you able to see progress.loaded? It may be because ContentLength is specified in your use case when working with stream. You may try using Buffer instead of stream to see if you get the progress.total. Let us know.

Here's our docs for reference: https://docs.aws.amazon.com/AWSJavaScriptSDK/v3/latest/Package/-aws-sdk-lib-storage/

@aBurmeseDev aBurmeseDev added response-requested Waiting on additional info and feedback. Will move to \"closing-soon\" in 7 days. p2 This is a standard priority issue and removed needs-triage This issue or PR still needs to be triaged. labels Jan 29, 2025
@arthikek
Copy link
Author

Hi thanks for helping me out. I choose to use the ContentLength provided from our database. So it's fine for now. Updating the package did the job. And it's working fine. Another question, I am uploading a 500mb file with this setup to s3. I noticed some CPU spikes up to 50% of my docker container during that upload. Does this has to do with the encryption happening due to multipart upload?

Is this completely normal as we are using node?

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to \"closing-soon\" in 7 days. label Jan 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug. p2 This is a standard priority issue
Projects
None yet
Development

No branches or pull requests

2 participants