Skip to content

IBX-11441: Provided proper index limit for the StringField for search engines#744

Open
barw4 wants to merge 11 commits into4.6from
ibx-11441-textblock-index-limit
Open

IBX-11441: Provided proper index limit for the StringField for search engines#744
barw4 wants to merge 11 commits into4.6from
ibx-11441-textblock-index-limit

Conversation

@barw4
Copy link
Copy Markdown
Contributor

@barw4 barw4 commented Apr 13, 2026

🎫 Issue IBX-11441

Description:

Solr and ES string fields are based on the Lucene architecture, therefore bytes limit for both is 32766.

In the case of TextBlock it never made sense to set the limit to 255, additionally we didn't have such a limit in TextLine.

For QA:

Documentation:

@barw4 barw4 self-assigned this Apr 13, 2026
@barw4 barw4 added Bug Something isn't working Ready for review labels Apr 13, 2026
@barw4 barw4 requested a review from a team April 14, 2026 07:21
Copy link
Copy Markdown
Contributor

@konradoboza konradoboza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't personally able to find any reasoning not to do this that way...
Image

Maybe @alongosz can give some historical explanation if there is any. 😉

@alongosz alongosz changed the title IBX-11441: Provide proper index limit for the StringField for search engines IBX-11441: Provided proper index limit for the StringField for search engines Apr 14, 2026
Copy link
Copy Markdown
Member

@alongosz alongosz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two possibilities come to mind:

  1. some early Solr version had different limit
  2. it was done to avoid inflating search engine, as the full text values were supposed to be supported by full text search fields only

I don't see in 2026 any reason to keep that old limit :)

* @param mixed $string
*/
private function extractShortText($string)
private function extractText($string): string
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm against changing the method name here, because what we are doing is we're extracting the first paragraph - and another "field" (see calling location) contains the whole text.

(that's what strtok does in this context)

Suggested change
private function extractText($string): string
private function extractShortText($string): string

or straight up declare what we're doing:

Suggested change
private function extractText($string): string
private function extractFirstParagraph($string): string

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think the method name is correct, but implementation is not. Imo we should store whole text, not just first paragraph by simply removing new lines. Wdyt @Steveb-p @konradoboza @alongosz?

);

// Enforce Lucene's bytes MAX_TERM_LENGTH to avoid silent indexing failures
return mb_strcut((string)$value, 0, self::MAX_TERM_LENGTH);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Databases when truncating usually emit a warning or straight up fail (depending on how strict they are).

I believe we should do the same if we're gonna do that.

Side note: I was considering if moving this truncation here might a BC break, but I feel it's not.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original issue was about preventing fails: ezsystems/ezpublish-kernel#1858 therefore I think logging might be better solution, added.

@barw4 barw4 requested a review from Steveb-p April 14, 2026 12:24
@barw4 barw4 force-pushed the ibx-11441-textblock-index-limit branch from 8453d37 to 10cdaf6 Compare April 14, 2026 12:25
Comment on lines +66 to +67
calls:
- [setLogger, ['@logger']]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be done by either autoconfigure: true or explicit monolog tagging with channel AFAIR.

Copy link
Copy Markdown
Contributor Author

@barw4 barw4 Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct me if I'm wrong but since RemoteIdentifierMapper and MultipleStringMapper extend StringMapper, they inherit its constructor. Without autowiring in the YAML, the container couldn't inject $logger into those child classes, causing them (and StringMapper itself) to fail silently — removing all string-type mappers from the registry, so I would say we need setLogger here?

@sonarqubecloud
Copy link
Copy Markdown

@barw4 barw4 requested a review from alongosz April 15, 2026 10:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Bug Something isn't working Ready for review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants