java - TextIO.Read GCS folders into pipeline with past 30 days date as name -


i want read rolling window of past 30 days pipeline e.g. on jan 15 2017, want read:

> gs://bucket/20170115/*  > gs://bucket/20170114/* >.  >. >. > gs://bucket/20161216/* 

this says ("*", "?", "[..]") glob patterns supported

similar question, no example

i trying avoid doing 30 text.io.read steps flattening pcollections one, causes hot shards in pipeline.

when reading files gcs, textio supports same wildcard patterns gcs, described here: wildcard names.

in answer question linked, bullet #2 suggests forming small number of globs represent full range:

for example 2 character range "23 through 67" 2[3-] plus [3-5][0-9] plus 6[0-7]


textio has new api readall() allows specify input files dynamically data. allows pass in exact set of filenames need:

private static list<string> generate30dayfileglobs(datetime now) {   // .. }  public static void main() {   pipeline p = // ..    p.apply(create.<string>of(generate30dayfileglobs(datetime.now())));    .apply(textio.readall());    // .. } 

the new textio.readall() api has not yet been released, can build master specifying beam artifact version 2.2.0-snapshot. 2.2.0 release is in progress , should available sometime in september.


Comments

Popular posts from this blog

angular - Ionic slides - dynamically add slides before and after -

Add a dynamic header in angular 2 http provider -

minify - Minimizing css files -