Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
7c63ebb
ignore .idea, target
lfoppiano Dec 16, 2025
3f584e5
add pom.xml, Readme.md and the data files
lfoppiano Dec 16, 2025
f9d929e
add makefile
lfoppiano Dec 16, 2025
7e7b3f5
add read warc
lfoppiano Dec 16, 2025
a998133
add CI + spotless
lfoppiano Dec 16, 2025
c808d8c
add figures, editorconfig, .gitignore from the python repository brother
lfoppiano Dec 16, 2025
fa3f707
remove unclear make install, remove venv info from readme
lfoppiano Dec 16, 2025
f6d62bb
update read class, add recompress,
lfoppiano Dec 17, 2025
4aa252a
cleanup, removing the rest of the python stuff for task 0,1,2
lfoppiano Dec 17, 2025
5b018e9
fix missing make install
lfoppiano Dec 18, 2025
817862c
move data under 'data' directory
lfoppiano Dec 18, 2025
620ebee
add Apache header in the code
lfoppiano Dec 18, 2025
886ff0b
make sure we build before running
lfoppiano Dec 18, 2025
6180fce
update .gitignore
lfoppiano Dec 19, 2025
d35e3d8
Implement WARC compression validation for Task 5
lfoppiano Dec 20, 2025
e20c81e
Ignore gzip validation if is uncompressed
lfoppiano Dec 20, 2025
07c9f8b
Merge branch 'main' into luca/feature/part2
lfoppiano Dec 22, 2025
0fa930e
fix compression check, update Readme.md
lfoppiano Dec 22, 2025
78fbac6
add missing apache licence
lfoppiano Dec 22, 2025
6f97782
add commons-compress library
lfoppiano Dec 22, 2025
52fca8c
place Github Actions in the correct directory
lfoppiano Dec 22, 2025
75af0e1
Add CDJX indexer using unreleased JARC code
lfoppiano Dec 23, 2025
077f904
Implement Task 3 and 4
lfoppiano Dec 28, 2025
3a2791a
fix: CI build
lfoppiano Dec 28, 2025
df257e4
fix: Reformat with spotless
lfoppiano Dec 28, 2025
3ed8d61
fix: Rename class
lfoppiano Dec 29, 2025
b3c7252
feat: task 7
lfoppiano Dec 29, 2025
e55c48a
feat: Task 8, duck DB with local file
lfoppiano Jan 5, 2026
43fc088
chore: Run spotless
lfoppiano Jan 5, 2026
893788e
chore(docu): minor changes
lfoppiano Jan 12, 2026
10dd303
Merge branch 'main' into luca/feature/part4
lfoppiano Jan 16, 2026
04505f7
fix: move stuff in data
lfoppiano Jan 16, 2026
e46fdd1
fix: path to the local paths file
lfoppiano Jan 19, 2026
b0c35f3
feat: add support for local parquet files in Duck application
lfoppiano Jan 19, 2026
ec39f21
feat: add download instructions
lfoppiano Jan 26, 2026
32b954f
chore: minor
lfoppiano Jan 26, 2026
2d73b37
Merge branch 'main' into luca/feature/support-local-path
lfoppiano Feb 11, 2026
0f12c7f
fix: make ifndef indentation
lfoppiano Feb 11, 2026
ce9547c
fix: recursive iteration on the local directory
lfoppiano Feb 11, 2026
5cfc5e9
fix: add crawl parameter in the queries, remove useless concatenation
lfoppiano Feb 11, 2026
358813f
fix: correct instructions for downloading from S3 and http
lfoppiano Feb 11, 2026
4d5b1c5
fix: code formatting
lfoppiano Feb 11, 2026
68ba160
doc: add example of data structure
lfoppiano Feb 12, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,12 @@ duck_ccf_local_files: build
@echo "warning! only works on Common Crawl Foundation's development machine"
mvn -q exec:java -Dexec.mainClass=org.commoncrawl.whirlwind.Duck -Dexec.args="ccf_local_files"

duck_local_files: build
ifndef LOCAL_DIR
$(error LOCAL_DIR is required. Usage: make duck_local_files LOCAL_DIR=/path/to/data)
endif
mvn -q exec:java -Dexec.mainClass=org.commoncrawl.whirlwind.Duck -Dexec.args="local_files $(LOCAL_DIR)"

duck_cloudfront: build
@echo "warning! this might take 1-10 minutes"
mvn -q exec:java -Dexec.mainClass=org.commoncrawl.whirlwind.Duck -Dexec.args="cloudfront"
Expand Down
43 changes: 41 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -791,11 +791,50 @@ The program then writes that one record into a local Parquet file, does a second
### Bonus: download a full crawl index and query with DuckDB

If you want to run many of these queries, and you have a lot of disk space, you'll want to download the 300 gigabyte index and query it repeatedly. Run
All of these scripts run the same SQL query and should return the same record (written as a parquet file).

```shell
mkdir -p 'crawl=CC-MAIN-2024-22/subset=warc'
aws s3 sync s3://commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2024-22/subset=warc/ 'crawl=CC-MAIN-2024-22/subset=warc'
```

> [!IMPORTANT]
> If you happen to be using the Common Crawl Foundation development server, we've already downloaded these files, and you can run ```make duck_ccf_local_files```

If, by any other chance, you don't have access through the AWS CLI:

```shell
mkdir -p 'crawl=CC-MAIN-2024-22/subset=warc'
cd 'crawl=CC-MAIN-2024-22/subset=warc'

wget https://data.commoncrawl.org/crawl-data/CC-MAIN-2024-22/cc-index-table.paths.gz
gunzip cc-index-table.paths.gz

grep 'subset=warc' cc-index-table.paths | \
awk '{print "https://data.commoncrawl.org/" $1, $1}' | \
xargs -n 2 -P 10 sh -c '
echo "Downloading: $2"
mkdir -p "$(dirname "$2")" &&
wget -O "$2" "$1"
' _

rm cc-index-table.paths
cd -
```

The structure should be something like this:
```shell
aws s3 sync s3://commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2024-22/subset=warc/ .'
tree my_data
my_data
└── crawl=CC-MAIN-2024-22
└── subset=warc
├── part-00000-4dd72944-e9c0-41a1-9026-dfd2d0615bf2.c000.gz.parquet
├── part-00001-4dd72944-e9c0-41a1-9026-dfd2d0615bf2.c000.gz.parquet
├── part-00002-4dd72944-e9c0-41a1-9026-dfd2d0615bf2.c000.gz.parquet
```

Then, you can run `make duck_local_files LOCAL_DIR=/path/to/the/downloaded/data` to run the same query as above, but this time using your local copy of the index files.

> [!IMPORTANT]
> If you happen to be using the Common Crawl Foundation development server, we've already downloaded these files, and you can run ```make duck_ccf_local_files```

Expand All @@ -821,7 +860,7 @@ We make more datasets available than just the ones discussed in this Whirlwind T

Common Crawl regularly releases Web Graphs which are graphs describing the structure and connectivity of the web as captured in the crawl releases. We provide two levels of graph: host-level and domain-level. Both are available to download [from our website](https://commoncrawl.org/web-graphs).

The host-level graph describes links between pages on the web at the level of hostnames (e.g. `en.wikipedia.org`). The domain-level graph aggregates this information in the host-level graph, describing links at the pay-level domain (PLD) level (based on the public suffix list maintained on [publicsuffix.org](publicsuffix.org)). The PLD is the subdomain directly under the top-level domain (TLD): e.g. for `en.wikipedia.org`, the TLD would be `.org` and the PLD would be `wikipedia.org`.
The host-level graph describes links between pages on the web at the level of hostnames (e.g. `en.wikipedia.org`). The domain-level graph aggregates this information in the host-level graph, describing links at the pay-level domain (PLD) level (based on the public suffix list maintained on [publicsuffix.org](https://publicsuffix.org)). The PLD is the subdomain directly under the top-level domain (TLD): e.g. for `en.wikipedia.org`, the TLD would be `.org` and the PLD would be `wikipedia.org`.

As an example, let's look at the [Web Graph release for March, April and May 2025](https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main-2025-mar-apr-may/index.html). This page provides links to download data associated with the host- and domain-level graph for those months. The key files needed to construct the graphs are the files containing the vertices or nodes (the hosts or domains), and the files containing the edges (the links between the hosts/domains). These are currently the top two links in each of the tables.

Expand Down
59 changes: 49 additions & 10 deletions src/main/java/org/commoncrawl/whirlwind/Duck.java
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ public class Duck {
private static final DateTimeFormatter TIMESTAMP_FORMATTER = DateTimeFormatter.ofPattern("yyyyMMddHHmmss");

public enum Algorithm {
CCF_LOCAL_FILES("ccf_local_files"), CLOUDFRONT("cloudfront");
CCF_LOCAL_FILES("ccf_local_files"), CLOUDFRONT("cloudfront"), LOCAL_FILES("local_files");

private final String name;

Expand Down Expand Up @@ -114,8 +114,13 @@ public static void printRowAsKvList(ResultSet rs, PrintStream out) throws SQLExc
/**
* Gets the list of parquet files to query based on the algorithm.
*/
public static List<String> getFiles(Algorithm algo, String crawl) throws IOException {
public static List<String> getFiles(Algorithm algo, String crawl, String localPrefix) throws IOException {
switch (algo) {
case LOCAL_FILES: {
Path indexPath = Path.of(localPrefix);
return getLocalParquetFiles(indexPath);
}

case CCF_LOCAL_FILES: {
Path indexPath = Path.of("/home/cc-pds/commoncrawl/cc-index/table/cc-main/warc", "crawl=" + crawl,
"subset=warc");
Expand Down Expand Up @@ -143,6 +148,23 @@ public static List<String> getFiles(Algorithm algo, String crawl) throws IOExcep
}
}

private static List<String> getLocalParquetFiles(Path indexPath) throws IOException {
if (!Files.isDirectory(indexPath)) {
System.err.println("Directory not found: " + indexPath);
System.exit(1);
}

List<String> files = Files.walk(indexPath).map(Path::toString).filter(string -> string.endsWith(".parquet"))
.collect(Collectors.toList());

if (files.isEmpty()) {
System.err.println("No parquet files found in: " + indexPath);
System.exit(1);
}

return files;
}

private static List<String> getLocalParquetFiles(Path indexPath, String prefix, String crawl) throws IOException {
if (!Files.isDirectory(indexPath)) {
printIndexDownloadAdvice(prefix, crawl);
Expand Down Expand Up @@ -190,6 +212,7 @@ private static ResultSet executeWithRetry(Statement stmt, String sql) throws SQL
public static void main(String[] args) {
String crawl = "CC-MAIN-2024-22";
Algorithm algo = Algorithm.CLOUDFRONT;
String localPrefix = "/home/cc-pds/commoncrawl/cc-index/table/cc-main/warc";

if (args.length > 0) {
if ("help".equalsIgnoreCase(args[0]) || "--help".equals(args[0]) || "-h".equals(args[0])) {
Expand All @@ -201,20 +224,30 @@ public static void main(String[] args) {
System.out.println("Using algorithm: " + algo.getName());
}

if (algo == Algorithm.LOCAL_FILES) {
if (args.length < 2) {
System.err.println("Error: local_files algorithm requires a directory argument.");
printUsage();
System.exit(1);
}
localPrefix = args[1];
}

try {
run(algo, crawl);
run(algo, crawl, localPrefix);
} catch (Exception e) {
System.err.println("Error: " + e.getMessage());
printUsage();
System.exit(1);
}
}

public static void run(Algorithm algo, String crawl) throws IOException, SQLException, InterruptedException {
public static void run(Algorithm algo, String crawl, String localPrefix)
throws IOException, SQLException, InterruptedException {
// Ensure stdout uses UTF-8
PrintStream out = new PrintStream(System.out, true, StandardCharsets.UTF_8);

List<String> files = getFiles(algo, crawl);
List<String> files = getFiles(algo, crawl, localPrefix);
String filesList = files.stream().map(f -> "'" + f + "'").collect(Collectors.joining(", "));

// Use in-memory DuckDB
Expand All @@ -230,15 +263,16 @@ public static void run(Algorithm algo, String crawl) throws IOException, SQLExce

// Count total records
out.printf("Total records for crawl: %s%n", crawl);
try (ResultSet rs = executeWithRetry(stmt, "SELECT COUNT(*) as cnt FROM ccindex")) {
try (ResultSet rs = executeWithRetry(stmt,
"SELECT COUNT(*) as cnt FROM ccindex " + "WHERE subset = 'warc' AND crawl = '" + crawl + "'")) {
if (rs.next()) {
out.println(rs.getLong("cnt"));
}
}

// Query for our specific row
String selectQuery = "" + "SELECT * FROM ccindex WHERE subset = 'warc' " + "AND crawl = 'CC-MAIN-2024-22' "
+ "AND url_host_tld = 'org' " + "AND url_host_registered_domain = 'wikipedia.org' "
String selectQuery = "SELECT * FROM ccindex WHERE subset = 'warc' AND crawl = '" + crawl + "' "
+ "AND url_host_tld = 'org' AND url_host_registered_domain = 'wikipedia.org' "
+ "AND url = 'https://an.wikipedia.org/wiki/Escopete'";

out.println("Our one row:");
Expand Down Expand Up @@ -305,14 +339,19 @@ private static void printResultSet(ResultSet rs, PrintStream out) throws SQLExce
}

private static void printUsage() {
System.err.println("Usage: Duck [algorithm]");
System.err.println("Usage: Duck [algorithm] [local-directory]");
System.err.println();
System.err.println("Query Common Crawl index using DuckDB.");
System.err.println();
System.err.println("Algorithms:");
System.err.println(" ccf_local_files Use local parquet files from /home/cc-pds/commoncrawl/...");
System.err.println(" local_files Use local parquet files (from specified local directory)");
System.err.println(
" ccf_local_files Use local parquet files (default: /home/cc-pds/commoncrawl/cc-index/table/cc-main/warc)");
System.err.println(" cloudfront Use CloudFront URLs (requires <crawl>.warc.paths.gz file)");
System.err.println();
System.err.println("Arguments:");
System.err.println(" local-directory Local directory prefix for 'local_files' algorithm");
System.err.println();
System.err.println("Options:");
System.err.println(" help, --help, -h Show this help message");
}
Expand Down