java - TextIO.Read GCS folders into pipeline with past 30 days date as name -
i want read rolling window of past 30 days pipeline e.g. on jan 15 2017, want read:
> gs://bucket/20170115/* > gs://bucket/20170114/* >. >. >. > gs://bucket/20161216/*
this says ("*", "?", "[..]") glob patterns supported
i trying avoid doing 30 text.io.read steps flattening pcollections one, causes hot shards in pipeline.
when reading files gcs, textio supports same wildcard patterns gcs, described here: wildcard names.
in answer question linked, bullet #2 suggests forming small number of globs represent full range:
for example 2 character range "23 through 67"
2[3-]
plus[3-5][0-9]
plus6[0-7]
textio
has new api readall()
allows specify input files dynamically data. allows pass in exact set of filenames need:
private static list<string> generate30dayfileglobs(datetime now) { // .. } public static void main() { pipeline p = // .. p.apply(create.<string>of(generate30dayfileglobs(datetime.now()))); .apply(textio.readall()); // .. }
the new textio.readall()
api has not yet been released, can build master specifying beam artifact version 2.2.0-snapshot
. 2.2.0 release is in progress , should available sometime in september.
Comments
Post a Comment