Migrating Splunk to SmartStore with GCP: A Field Guide
After spending way too long troubleshooting a SmartStore migration that "should have been working," I figured I'd document what actually happened and the commands that saved us.
The Setup
We needed to migrate a standalone Splunk indexer from local storage (10TB disk) to SmartStore backed by Google Cloud Storage. The goal was to reduce local disk to 2.5TB by offloading warm/cold buckets to GCS while keeping a 1.8TB local cache.
GCP Configuration
Create the Bucket
The bucket needs to be in the same region as your Compute Engine VM to avoid egress costs and latency:
bash
gsutil mb -l US-CENTRAL1 gs://your-splunk-smartstore/
IAM Permissions
SmartStore needs these permissions on the bucket:
- storage.objects.create
- storage.objects.get
- storage.objects.delete
- storage.objects.list
- storage.buckets.get
You can use a predefined role like roles/storage.objectAdmin or create a custom role with least privilege. We created splunkStorageObjectUser with just those permissions.
Grant it to your service account:
bash
gsutil iam ch serviceAccount:YOUR-SA@developer.gserviceaccount.com:projects/YOUR-PROJECT/roles/splunkStorageObjectUser gs://your-splunk-smartstore
VM OAuth Scopes (The Thing Everyone Forgets)
GCP has three layers of access control: 1. Service Account - the identity 2. OAuth Scope - the API gate (what APIs the VM can call) 3. IAM Role - actual permissions
Even with the right IAM role, if your VM has devstorage.read_only scope, writes will fail. Check your scopes:
bash
gcloud compute instances describe YOUR-VM --zone=YOUR-ZONE --format="yaml(serviceAccounts)"
To change scopes, you must stop the VM:
bash
gcloud compute instances stop YOUR-VM --zone=YOUR-ZONE
gcloud compute instances set-service-account YOUR-VM \
--zone=YOUR-ZONE \
--scopes=storage-full,logging-write,monitoring-write,service-management,service-control,trace
gcloud compute instances start YOUR-VM --zone=YOUR-ZONE
Splunk Configuration
indexes.conf - Add the Volume
In $SPLUNK_HOME/etc/system/local/indexes.conf, add:
ini
[volume:remote_store]
storageType = remote
path = gs://your-splunk-smartstore
indexes.conf - Add remotePath to Each Index
Every index you want migrated needs remotePath. Add this line to each index stanza:
ini
remotePath = volume:remote_store/$_index_name
Important: Check ALL indexes.conf files. We had indexes defined in:
- system/local/indexes.conf
- apps/search/local/indexes.conf
- apps/TA-crowdstrike-falcon-event-streams/local/indexes.conf
- apps/duo_splunkapp/local/indexes.conf
- apps/Splunk_TA_Google_Workspace/local/indexes.conf
Use this to find them all:
bash
find /opt/splunk/etc -name "indexes.conf" -exec grep -l "coldPath" {} \;
server.conf - The Critical Part We Missed
This is what cost us 10 days. Without the [cachemanager] stanza, SmartStore uploads data but never evicts it locally.
In $SPLUNK_HOME/etc/system/local/server.conf:
ini
[cachemanager]
max_cache_size = 1800000
hotlist_recency_secs = 86400
hotlist_bloom_filter_recency_hours = 360
max_cache_size- Maximum local cache in MB (1800000 = 1.8TB)hotlist_recency_secs- Buckets accessed within this time won't be evicted (86400 = 24 hours)hotlist_bloom_filter_recency_hours- Buckets with data this recent won't be evicted (360 = 15 days)
If you skip this, your data uploads to GCS but local disk never shrinks.
Verification Commands
Test GCS Connectivity
bash
/opt/splunk/bin/splunk cmd splunkd rfs -- ls volume:remote_store
No output + no error = working. Errors mean auth/permission issues.
Check if Config is Applied
```bash
Check volume
/opt/splunk/bin/splunk btool indexes list volume:remote_store --debug
Check specific index has remotePath
/opt/splunk/bin/splunk btool indexes list YOUR_INDEX --debug | grep remotePath
Check cachemanager settings
/opt/splunk/bin/splunk btool server list cachemanager --debug ```
If btool server list cachemanager returns nothing, your cache settings aren't applied.
Verifying Local vs Remote Data
What's Using Local Disk
```bash
Top consumers
du -sh /opt/splunk/var/lib/splunk/*/ 2>/dev/null | sort -rh | head -20
Specific index
du -sh /opt/splunk/var/lib/splunk/YOUR_INDEX/ ```
What's in the Remote Bucket
```bash
Total size
gsutil du -sh gs://your-splunk-smartstore/
Specific index
gsutil du -sh gs://your-splunk-smartstore/YOUR_INDEX/
List bucket contents
gsutil ls gs://your-splunk-smartstore/YOUR_INDEX/db/ | head -10 ```
Verify a Specific Bucket Exists in Both Places
List local buckets and get the oldest one:
bash
ls /opt/splunk/var/lib/splunk/YOUR_INDEX/db/ | grep "^db_" | sort -t'_' -k3 -n | head -5
Bucket names are formatted: db_<latest_epoch>_<earliest_epoch>_<id>_<guid>
Convert timestamp to human readable:
bash
date -d @1755734703
Search for that bucket in GCS:
bash
gsutil ls -r gs://your-splunk-smartstore/YOUR_INDEX/db/** 2>/dev/null | grep "BUCKET_ID~"
Check Bucket Upload Status
Inside each bucket directory, cachemanager_local.json shows what's cached:
bash
cat /opt/splunk/var/lib/splunk/YOUR_INDEX/db/BUCKET_DIR/cachemanager_local.json
If journal_gz is NOT in the file_types list, rawdata was evicted. If it's there, rawdata is still local.
Count Buckets Still Holding Rawdata
bash
find /opt/splunk/var/lib/splunk/YOUR_INDEX/db/*/rawdata -type d 2>/dev/null | wc -l
Size of Rawdata Still Local
bash
du -shc /opt/splunk/var/lib/splunk/YOUR_INDEX/db/*/rawdata 2>/dev/null
Troubleshooting Eviction
Check if Eviction is Running
bash
grep -i "evict" /opt/splunk/var/log/splunk/splunkd.log | tail -20
Good output shows bytes_evicted > 0:
Eviction results: count=310, test_count=321, bytes_evicted=2680505613
Bad output shows eviction failing:
Unable to evict enough data. Evicted size=0 instead of size=1473368064
Why Eviction Fails
-
No cachemanager config - Most common. Check btool.
-
Buckets not uploaded yet - Can't evict what's not in remote storage. Check if buckets exist in GCS.
-
Hotlist protection - Recent data is protected. Buckets with data newer than
hotlist_bloom_filter_recency_hourswon't evict. -
Active searches - Searches keep buckets in cache. If you have searches running every 11 minutes hitting 60 days of data, those buckets stay cached.
Check Tracking File
Each index has a tracking file for synced buckets:
bash
cat /opt/splunk/var/lib/splunk/YOUR_INDEX/db/.buckets_synced_to_remote_storage | wc -l
If this is empty but data is in GCS, there's a metadata sync issue. Restart Splunk:
bash
/opt/splunk/bin/splunk restart
Fix File Ownership
If Splunk runs as splunk user but files are owned by root:
bash
chown -R splunk:splunk /opt/splunk/var/lib/splunk/
Understanding What Gets Evicted
SmartStore eviction removes data in stages:
- Rawdata (journal.gz) - Evicted first, bulk of the data
- tsidx, bloomfilter - Kept longer for search performance
- Metadata files - Kept locally
After eviction, a bucket directory still exists but contains only search metadata (~500MB tsidx + ~18MB bloomfilter per bucket). This adds up across hundreds of buckets.
Reducing Internal Index Retention
The _internal index can get huge. To reduce retention:
ini
[_internal]
frozenTimePeriodInSecs = 604800
Common values: - 7 days = 604800 - 14 days = 1209600 - 30 days = 2592000 - 90 days = 7776000
Monitoring Progress
Watch disk usage over time:
bash
watch -n 60 'df -h / | grep -v Filesystem'
Check eviction activity:
bash
tail -f /opt/splunk/var/log/splunk/splunkd.log | grep -i evict
Lessons Learned
- Always add the cachemanager stanza - Without it, nothing evicts
- Check ALL indexes.conf files - Apps create their own
- VM scopes matter - IAM roles aren't enough on GCP
- Eviction takes time - Don't panic if disk doesn't shrink immediately
- Searches keep data cached - Frequent searches on old data prevent eviction
- The UI shows logical size - Use
dufor actual local disk usage
Comments
Post a Comment