fix(server): preserve spaces in AI summary stream by removing .trim#149
fix(server): preserve spaces in AI summary stream by removing .trim#149
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses an issue where AI-generated summaries, streamed from the Gemini API, were losing spaces between words. The problem stemmed from the Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
The pull request simplifies the processing of LLM-generated text by removing a conditional trim operation and consistently stripping "html" and "" prefixes/suffixes. A critical security concern was raised regarding the summarizeStream function, which emits unsanitized HTML content from the LLM. This poses a high-severity Cross-Site Scripting (XSS) vulnerability, as malicious HTML from the LLM could lead to stored XSS if rendered directly by the client. Server-side HTML sanitization is recommended to mitigate this risk.
| text.stripPrefix("```html").stripSuffix("```") | ||
| } | ||
| .filter(_.nonEmpty) | ||
| .map(SummaryEvent.Content(_)) |
There was a problem hiding this comment.
The summarizeStream function generates summaries from LLM output, which is explicitly instructed to be in HTML format. The raw HTML output from the LLM is then encapsulated within a SummaryEvent.Content object and emitted. There is no server-side sanitization of this HTML content to prevent potential Cross-Site Scripting (XSS) vulnerabilities. Although the prompt instructs the LLM to "Never use <script> tags", relying solely on LLM instructions for security is insufficient. If the LLM were to generate malicious HTML (e.g., containing <script> tags or other XSS payloads), and this SummaryEvent.Content is rendered directly on the client-side without proper sanitization, it could lead to stored XSS.
Remediation: Implement robust HTML sanitization on the LLM's output before it is sent to the client. This should be done server-side using a well-maintained HTML sanitization library (e.g., OWASP Java HTML Sanitizer for Scala/Java applications). The sanitization should remove any potentially malicious tags, attributes, or JavaScript. The client-side rendering should also treat this content as untrusted and apply appropriate escaping or sanitization if it's not already guaranteed to be safe by the server.
This PR fixes the issue where spaces were missing between words in the AI summary stream. The issue was caused by calling
.trimon each individual chunk received from the Gemini API stream, leading to merged words when chunks were concatenated on the client.Key changes:
.trim()from the summary chunks inSummarizeService.scala..stripPrefixand.stripSuffixto remove markdown wrappers without affecting text content.